Commit graph

2527 commits

Author SHA1 Message Date
Mincho Paskalev
b5a37c0e42
Add cmd tips for HOTKEYS. Return err when hotkeys START specifies invalid slots (#14761)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
- When passing slots not within the range of a node to `HOTKEYS START
SLOTS ...` the hotkey command now returns error.
- Changed the cmd tips for the HOTKEYS subcommands so that they reflect
the special nature of the cmd in cluster mode - i.e command should be
issued against a single node only. Clients should not care about cluster
management and aggregation of results.
- Change reply schema to return Array of the maps. For a single node
this will return array of 1 element. Getting results from multiple nodes
will make it easy to concatenate the elements into one array.
2026-02-03 17:54:32 +02:00
Slavomir Kaslev
bafaec5b6a
Fix HOTKEYS to track each command in a MULTI/EXEC block (#14756)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Fix HOTKEYS to track each command in a MULTI/EXEC block.
2026-02-02 09:50:44 +02:00
Slavomir Kaslev
ca681f997e
Add LTRIM/LREM and RM_StringTruncate() memory tracking tests (#14751)
Add LTRIM/LREM and RM_StringTruncate() memory tracking tests.
2026-01-29 13:04:46 +02:00
Mincho Paskalev
591fc90263
Change reply schema for hotkeys get to use map instead of flat array (#14749)
Follow #14680
Reply of `HOTKEYS GET` is an unordered collection of key-value pairs. It
is more reasonable to be a map in resp3 instead of flat array.
2026-01-29 11:21:05 +02:00
debing.sun
beb75e40bf
Fix test failure when using bind "*" in introspection.tcl (#14745)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
The reason for the failure is that when starting server with bind *, the
host will be set to *. At this time, when reconnect, the client will not
recognize this host.
So this fix skipped checking whether the server was ready.
2026-01-27 20:50:06 +08:00
Mincho Paskalev
b209e8afde
Fix hotkey info metric names. Disable HOTKEY SLOTS param for non-cluster (#14742)
Some hotkeys cpu metrics display time in milliseconds others in
microseconds.

Change the metrics showing time of command executions to all use
microseconds and use the `-us` postfix to show that.

Also, disable the `SLOTS` param for `HOTKEYS START` if we are not in
cluster mode.
2026-01-26 13:32:05 +02:00
Stav-Levi
a765ee8238
Add security configuration warnings at startup (#14708)
Adds startup-time security warnings when the default user permits
unauthenticated access, with behavior dependent on protected-mode and
bind settings.
Warnings are skipped in Sentinel mode since it intentionally
disables protected-mode by design.

- No password + no protected-mode + no bind: warn about accepting
  connections from any IP/interface
- No password + no protected-mode: warn about accepting connections
  from any IP on configured interface
- No password + protected-mode enabled: warn about accepting
  connections from local clients
2026-01-26 16:58:53 +08:00
debing.sun
18538461d1
Add separate statistics for active expiration of keys and hash fields (#14727)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
### Summary

Adds `expired_keys_active` and `expired_subkeys_active` counters to
track keys and hash fields expired by the active expiration cycle,
distinguishing them from lazy expirations.
These new metrics are exposed in INFO stats output.

### Motivation

Currently, Redis tracks the total number of expired keys (expired_keys)
and expired hash fields (expired_subkeys), but there's no way to
differentiate between expirations triggered by active expire and lazy
expire.

---------

Co-authored-by: Moti Cohen <moti.cohen@redis.com>
2026-01-22 22:30:25 +08:00
Slavomir Kaslev
5dec7d3675
Add key allocation sizes histogram (#14695)
Add key allocation sizes histograms based on previous memory accounting work
in #14363 and #14451.

The histograms are exposed via `INFO keysizes` and use logarithmic (power-of-2) bins,
similar to current key sizes/length histogram implementation in the following fields:

    db0_distrib_lists_sizes:1=...,2=...,4=...
    db0_distrib_sets_sizes:1=...,2=...,4=...
    db0_distrib_hashes_sizes:1=...,2=...,4=...
    db0_distrib_zsets_sizes:1=...,2=...,4=...

To avoid confusion with existing distrib_strings_sizes histograms which are based on
string lengths we don't report allocation sizes histograms for strings.

So far per key and per slot memory accounting code has been relying type specific functions
(hashTypeAllocSize(), listTypeAllocSize(), zsetAllocSize(), etc) for computing data structure
allocation sizes since it's faster and we only need to track size deltas and not the complete
allocation size along with the kvobj and key length overhead. In order to keep the allocation
sizes histogram consistent, memory accounting code has been switched to use kvobjAllocSize()
instead which does return the total allocation size.

Note that the feature is enabled with `key-bytes-stats` or `cluster-slot-stats` config in redis
config file on startup.
2026-01-22 09:40:04 +02:00
Paulo Sousa
c4baa64ea8
Optimize peak memory stats by switching from per-command checks to threshold-based (#14692)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR optimizes peak memory tracking by moving from **per-command
checks** to a **threshold-based mechanism** in `zmalloc`.

Instead of updating peak memory on every command, peak tracking is now
triggered only when a thread's memory delta exceeds **100KB**. This
reduces runtime overhead while keeping peak memory accuracy acceptable.

## Implementation Details

- Peak memory is tracked atomically in `zmalloc` when a thread's memory
delta exceeds 100KB
- Thread-safe peak updates using CAS
- Peak tracking considers both:
  - current used memory
  - zmalloc-reported peak memory

## Performance Results (ARM AArch64)

All performance numbers were obtained on an **AWS m8g.metal (ARM
AArch64)** instance.

The database was pre-populated with **1M keys**, each holding a **1KB
value**.
Benchmarks were executed using memtier with a **10 SET : 90 GET ratio**
and **pipeline = 10** ([full benchmark spec.
here](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1Mkeys-string-setget200c-1KiB-pipeline-10.yml)).

| Environment | Baseline `redis/redis` unstable (median ± std.dev) |
Comparison `paulorsousa/redis`
`f05a4bd273cb4d63ff03d33e6207837b6e51de86` (median) | % change (higher
better) | Note |

|------------------------------|----------------------------------------------------|----------------------------------------------------------------------------------:|--------------------------|-----------------------|
| oss-standalone | 802,830 ± 0.2% (7 datapoints) | 796,660 | -0.8% | No
change |
| oss-standalone-02-io-threads | 982,698 ± 0.6% (7 datapoints) | 980,520
| -0.2% | No change |
| oss-standalone-04-io-threads | 2,573,244 ± 1.9% (7 datapoints) |
2,630,931 | +2.2% | Potential improvement |
| oss-standalone-08-io-threads | 2,343,609 ± 1.6% (7 datapoints) |
2,455,630 | +4.8% | Improvement |
2026-01-21 22:52:31 +08:00
Mincho Paskalev
e3c38aab66
Handle primary/replica clients in IO threads (#14335)
# Problem

While introducing Async IO
threads(https://github.com/redis/redis/pull/13695) primary and replica
clients were left to be handled inside main thread due to data race and
synchronization issues. This PR solves this issue with the additional
hope it increases performance of replication.

# Overview

## Moving the clients to IO threads

Since clients first participate in a handshake and an RDB replication
phases it was decided they are moved to IO-thread after RDB replication
is done. For primary client this was trivial as the master client is
created only after RDB sync (+ some additional checks one can see in
`isClientMustHandledByMainThread`). Replica clients though are moved to
IO threads immediately after connection (as are all clients) so
currently in `unstable` replication happens while this client is in
IO-thread. In this PR it was moved to main thread after receiving the
first `REPLCONF` message from the replica, but it is a bit hacky and we
can remove it. I didn't find issues between the two versions.

## Primary client (replica node)

We have few issues here:
- during `serverCron` a `replicationCron` is ran which periodically
sends `REPLCONF ACK` message to the master, also checks for timed-out
master. In order to prevent data races we utilize`IOThreadClientsCron`.
The client is periodically sent to main thread and during
`processClientsFromIOThread` it's checked if it needs to run the
replication cron behaviour.

- data races with main thread - specifically `lastinteraction` and
`read_reploff` members of the primary client that are written to in
`readQueryFromClient` could be accessed at the same time from main
thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To
solve this the members were duplicated so if the client is in IO-thread
it writes to the duplicates and they are synced with the original
variables each time the client is send to main thread ( that means `INFO
REPLICATION` could potentially return stale values).

- During `freeClient` the primary client is fetched to main thread but
when caching it(`replicationCacheMaster`) the thread id will remain the
id of the IO thread it was from. This creates problems when resurrecting
the master client. Here the call to `unbindClientFromIOThreadEventLoop`
in `freeClient` was rewritten to call `keepClientInMainThread` which
automatically fixes the problem.

- During `exitScriptTimedoutMode` the master is queued for reprocessing
(specifically process any pending commands ASAP after it's unblocked).
We do that by putting it in the `server.unblocked_clients` list, which
are processed in the next `beforeSleep` cycle in main thread. Since this
will create a contention between main and IO thread, we just skip this
queueing in `unblocked_clients` and just queue the client to main thread
- the `processClientsFromIOThread` will process the pending commands
just as main would have.

## Replica clients (primary node)

We move the client after RDB replication is done and after replication
backlog is fed with its first message.
We do that so that the client's reference to the first replication
backlog node is initialized before it's read from IO-thread, hence no
contention with main thread on it.

### Shared replication buffer

Currently in unstable the replication buffer is shared amongst clients.
This is done via clients holding references to the nodes inside the
buffer. A node from the buffer can be trimmed once each replica client
has read it and send its contents. The reference is
`client->ref_repl_buf_node`. The replication buffer is written to by
main thread in `feedReplicationBuffer` and the refcounting is intrusive
- it's inside the replication-buffer nodes themselves.

Since the replica client changes the refcount (decreases the refcount of
the node it has just read, and increases the refcount of the next node
it starts to read) during `writeToClient` we have a data race with main
thread when it feeds the replication buffer. Moreover, main thread also
updates the `used` size of the node - how much it has written to it,
compared to its capacity which the replica client relies on to know how
much to read. Obviously replica being in IO-thread creates another data
race here. To mitigate these issues a few new variables were added to
the client's struct:

- `io_curr_repl_node` - starting node this replica is reading from
inside IO-thread
- `io_bound_repl_node` - the last node in the replication buffer the
replica sees before being send to IO-thread.

These values are only allowed to be updated in main thread. The client
keeps track of how much it has read into the buffer via the old
`ref_repl_buf_node`. Generally while in IO-thread the replica client
will now keep refcount of the `io_curr_repl_node` until it's processed
all the nodes up to `io_bound_repl_node` - at that point its returned to
main thread which can safely update the refcounts.
The `io_bound_repl_node` reference is there so the replica knows when to
stop reading from the repl buffer - imagine that replica reads from the
last node of the replication buffer while main thread feeds data to it -
we will create a data race on the `used` value
(`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)).
That's why this value is updated just before the replica is being send
to IO thread.
*NOTE*, this means that when replicas are handled by IO threads they
will hold more than one node at a time (i.e `io_curr_repl_node` up to
`io_bound_repl_node`) meaning trimming will happen a bit less
frequently. Tests show no significant problems with that.
(tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node`
mechanism as my initial implementation had similar semantics but was way
less clear)

Example of how this works:

* Replication buffer state at time N:
   | node 0| ... | node M, used_size K |
* replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and
`io_bound_block_pos`=K
* replica moves to IO thread and processes all the data it sees
* Replication buffer state at time N + 1:
| node 0| ... | node M, used_size Full | |node M + 1| |node M + 2,
used_size L|, where Full > M
* replica moves to main thread at time N + 1, at this point following
happens
   - refcount to node 0 (io_curr_repl_node) is decreased
- `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have
size-K bytes to process from there)
- refcount to node M is increased (now all nodes from 0 up to M-1
including can be trimmed unless some other replica holds reference to
them)
- And just before the replica is send back to IO thread the following
are updated:
   - `io_bound_repl_node` ref becomes node M+2
   - `io_bound_block_pos` becomes L

Note that replica client is only moved to main if it has processed all
the data it knows about (i.e up to `io_bound_repl_node` +
`io_bound_block_pos`)

### Replica clients kept in main as much as possible

During implementation an issue arose - how fast is the replica client
able to get knowledge about new data from the replication buffer and how
fast can it trim it. In order for that to happen ASAP whenever a replica
is moved to main it remains there until the replication buffer is fed
new data. At that point its put in the pending write queue and special
cased in handleClientsWithPendingWrites so that its send to IO thread
ASAP to write the new data to replica. Also since each time the replica
writes its whole repl data it knows about that means after it's send to
main thread `processClientsFromIOThread` is able to immediately update
the refcounts and trim whatever it can.

### ACK messages from primary

Slave clients need to periodically read `REPLCONF ACK` messages from
client. Since replica can remain in main thread indefinitely if no DB
change occurs, a new atomic `pending_read` was added during
`readQueryFromClient`. If a replica client has a pending read it's
returned back to IO-thread in order to process the read even if there is
no pending repl data to write.

### Replicas during shutdown

During shutdown the main thread pauses write actions and periodically
checks if all replicas have reached the same replication offset as the
primary node. During `finishShutdown` that may or may not be the case.
Either way a client data may be read from the replicas and even we may
try to write any pending data to them inside `flushSlavesOutputBuffers`.
In order to prevent races all the replicas from IO threads are moved to
main via `fetchClientFromIOThread`. A cancel of the shutdown should be
ok, since the mechanism employed by `handleClientsWithPendingWrites`
should return the client back to IO thread when needed.

## Notes

While adding new tests timing issues with Tsan tests were found and
fixed.

Also there is a data race issue caught by Tsan on the `last_error`
member of the `client` struct. It happens when both IO-thread and main
thread make a syscall using a `client` instance - this can happen only
for primary and replica clients since their data can be accessed by
commands send from other clients. Specific example is the `INFO
REPLICATION` command.
Although other such races were fixed, as described above, this once is
insignificant and it was decided to be ignored in `tsan.sup`.

---------

Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2026-01-21 16:19:12 +02:00
Slavomir Kaslev
b9c00b27f8
Make cluster-slot-stats-enabled config multivalued (#14719)
This allows users to specify exactly what per slot statistics are to be
collected -- CPU, network traffic and/or memory used.

The config accepts multiple values as a space-separated list:
  - cpu: Track CPU usage per slot (cpu-usec metric)
  - net: Track network bytes per slot (network-bytes-in, network-bytes-out metrics)
  - mem: Track memory usage per slot (memory-bytes metric)
  - yes: Enable all tracking (equivalent to "cpu net mem")
  - no: Disable all tracking (default)

Note: Memory tracking (mem) can ONLY be enabled at startup. If you try to enable
memory tracking via CONFIG SET when it wasn't enabled at startup, the command will
fail. However, you can disable memory tracking at runtime by removing the 'mem' flag.
Once disabled, memory tracking cannot be re-enabled without restarting the server.
2026-01-21 15:36:03 +02:00
Yuan Wang
a2e901c93d
Fix inaccurate IO thread client count due to delayed freeing (#14723)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
There is a failure in CI:
```
*** [err]: Clients are evenly distributed among io threads in tests/unit/introspection.tcl
Expected '2' to be equal to '1' (context: type eval line 3 cmd {assert_equal $cur_clients 1} proc ::start_server)
```

There might be a client used for health checks (to detect if the server
is up)
that has not been freed timely. This can lead to an inaccurate count of
connected clients processed by IO threads. So we wait it to close
completely.
2026-01-21 18:13:40 +08:00
Stav-Levi
25f780b662
Fix crash when calling internal container command without arguments (#14690)
Addresses crash and clarifies errors around container commands.

- Update server.c to handle container commands with no subcommand: emit
"missing subcommand. Try HELP."; keep "unknown subcommand" for invalid
subcommands; for unknown commands, include args preview only when
present
- Add a test module command subcommands.internal_container with a
subcommand for validation
- Add unit test asserting missing subcommand error when calling the
internal container command without arguments
2026-01-21 08:38:04 +02:00
debing.sun
e76e3af5b7
Fix some test timing issues in replication.tcl and maxmemory.tcl (#14718)
1) Replace fixed sleep with wait_for_condition to avoid flaky test
failures when checking master_current_sync_attempts counter.

2) Similar to https://github.com/redis/redis/pull/14674, use
assert_lessthan_equal instead of assert_lessthan to verify the idle
time.
2026-01-20 19:25:15 +08:00
debing.sun
d2da5cca37
Fix timeout waiting for blocked clients in pause test (#14716)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
To verify the pause duration, we need to wait for the client to be
unpause and the command to complete, so add `$rd read` to wait for the
command to finish.

The test failure was caused by $rd still being blocked and not closed in
the previous test, so the next test would get 2 blocked clients instead
of 1 client, causing the test to fail.
2026-01-20 17:12:22 +08:00
Yuan Wang
cfa6129040
Minor fixes for ASM (#14707)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
- **TCL test failure**

https://github.com/redis/redis/actions/runs/21121021310/job/60733781853#step:6:5705
```
[err]: Test cluster module notifications when replica restart with RDB during importing
in tests/unit/cluster/atomic-slot-migration.tcl
Expected '{sub: cluster-slot-migration-import-started, source_node_id:28c64b3f462f3c29aa3c96c2ba5dff948dfe315b, destination_node_id:1382a4b4ca86621e39068ee8b25524a44a21bbc1, task_id:4d185a5398be94edac0dd77fff094eb7f5c73ec4, slots:0-100}' to be equal to '{sub: cluster-slot-migration-import-started, source_node_id:28c64b3f462f3c29aa3c96c2ba5dff948dfe315b, destination_node_id:1382a4b4ca86621e39068ee8b25524a44a21bbc1, task_id:4d185a5398be94edac0dd77fff094eb7f5c73ec4, slots:0-100} {sub: cluster-slot-migration-import-completed, source_node_id:28c64b3f462f3c29aa3c96c2ba5dff948dfe315b, destination_node_id:1382a4b4ca86621e39068ee8b25524a44a21bbc1, task_id:4d185a5398be94edac0dd77fff094eb7f5c73ec4, slots:0-100}' (context: type eval line 29 cmd {assert_equal  [list  "sub: cluster-slot-migration-import-started, source_node_id:$src_id, destination_node_id:$dest_id, task_id:$task_id, slots:0-100"  ] [R 4 asm.get_cluster_event_log]} proc ::test)
```
If there is a delay to work to check, the ASM task may complete, so we
will get `started & completed` ASM log instead of only `started` log, it
feels fragile, so delete the check, we will check all logs later.
```
                restart_server -4 true false true save ;# rdb save
---> if there is a delay, the ASM task should complete
                # the asm task info in rdb will fire module event
                assert_equal  [list \
                    "sub: cluster-slot-migration-import-started, source_node_id:$src_id, destination_node_id:$dest_id, task_id:$task_id, slots:0-100" \
                ] [R 4 asm.get_cluster_event_log]
```
- **Start BGSAVE for slot snapshot ASAP**
Since we consider the migrating client as a replica that wants diskless
replication, so it will wait for repl-diskless-sync-delay` to start a
new fork after the last child exits. But actually slot snapshot can not
be shared with other slaves, so we can start BGSAVE for it immediately.

  also resolve internal ticket RED-177974.
2026-01-19 19:57:20 +08:00
debing.sun
39881fa6f2
Reply Copy Avoidance (#14608)
This PR is based on https://github.com/valkey-io/valkey/pull/2078

# Reply Copy Avoidance Optimization

This PR introduces an optimization to avoid unnecessary memory copies
when sending replies to clients in Redis.

## Overview

Currently, Redis copies reply data into client output buffers before
sending responses. This PR implements a mechanism to avoid these copies
in certain scenarios, improving performance and reducing memory
overhead.

### Key Changes
* Added capability to reply construction allowing to interleave regular
replies with copy avoid replies in client reply buffers
* Extended write-to-client handlers to support copy avoid replies
* Added copy avoidance of string bulk replies when copy avoidance
indicated by I/O threads
* Copy avoidance is beneficial for performance despite object size only
starting certain number of threads. So it will be enabled only starting
certain number of threads.

**Note**: When copy avoidance disabled content and handling of client
reply buffers remains as before this PR

---------

Signed-off-by: Alexander Shabanov <alexander.shabanov@gmail.com>
Signed-off-by: xbasel <103044017+xbasel@users.noreply.github.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Alexander Shabanov <alexander.shabanov@gmail.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com>
Co-authored-by: moticless <moticless@github.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2026-01-19 11:09:16 +08:00
Filipe Oliveira (Redis)
7f541b9607
Prefetch client fields before prefetching command-related data (#14700)
This PR refines the prefetch strategy by removing ineffective (to close
on the pipeline) dictionary-level prefetching and improving prefetch
usage in IO threads. The goal is to better aligning prefetches with
predictable access patterns.

## Changes

- Removed speculative prefetching from `dictFindLinkInternal()`,
simplifying the dictionary lookup hot path.
- Introduced a two-phase prefetch approach in
`prefetchIOThreadCommands()`:
  - Phase 1: Prefetch client structures and `pending_cmds`
- Phase 2: Add commands to the batch and prefetch follow-up fields
(`reply`, `mem_usage_bucket`)

## Performance

Measured with
`memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-16`.

| Environment                  | % change |
|-----------------------------|----------|
| oss-standalone               | -0.1%    |
| oss-standalone-02-io-threads | +0.4%    |
| oss-standalone-04-io-threads | +1.6%    |
| oss-standalone-08-io-threads | +2.3%    |
| oss-standalone-12-io-threads | +0.7%    |
| oss-standalone-16-io-threads | +1.9%    |

Overall, this shows an ~2% throughput improvement on IO-threaded
configurations, with no meaningful impact on non-IO-threaded setups.

---------

Co-authored-by: Yuan Wang <wangyuancode@163.com>
2026-01-18 20:14:39 +08:00
Mincho Paskalev
c93e4a62c6
Add hotkeys detection (#14680)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
# Description

Introducing a new method for identifying hotkeys inside a redis server
during a tracking time period.

Hotkeys in this context are defined by two metrics:
* Percentage of time spend by cpu on the key from the total time during
the tracking period
* Percentage of network bytes (input+output) used for the key from the
total network bytes used by redis during the tracking period

## Usage

Although the API is subject to change the general idea is for the user
to initiate a hotkeys tracking process which should run for some time.
The keys' metrics are recorded inside a probabilistic structure and
after that the user is able to fetch the top K of them.

### Current API

```
HOTKEYS START
            <METRICS count [CPU] [NET]>
            [COUNT k] 
            [DURATION duration]
            [SAMPLE ratio]
            [SLOTS count slot…]

HOTKEYS GET
HOTKEYS STOP
HOTKEYS RESET

```

### HOTKEYS START

Start a tracking session if either no is already started, or one was
stopped or reset. Return error if one is in progress.

* METRICS count [CPU] [NET] - chose one or more metrics to track
* COUNT k - track top K keys
* DURATION duration - preset how long the tracking session should last
* SAMPLE ratio - a key is tracked with probability 1/ratio
* SLOTS count slot... - Only track a key if it's in a slot amongst the
chosen ones

### HOTKEYS GET

Return array of the chosen metrics to track and various other metadata.
(nil) if no tracking was started or it was reset.

```
127.0.0.1:6379> hotkeys get
1) "tracking-active"
2) 1
3) "sample-ratio"
4) <ratio>
5) "selected-slots" (empty array if no slots selected)
6) 1) 0
   2) 5
   3) 6
7) "sampled-command-selected-slots-ms" (show on condition sample-ratio > 1 and selected-slots != empty-array)
8) <time-in-milliseconds>
9) "all-commands-selected-slots-ms" (show on condition selected-slots != empty-array)
10) <time-in-milliseconds>
11) "all-commands-all-slots-ms"
12) <time-in-milliseconds>
13) "net-bytes-sampled-commands-selected-slots" (show on condition sample-ratio > 1 and selected-slots != empty-array)
14) <num-bytes>
15) "net-bytes-all-commands-selected-slots" (show on condition selected-slots != empty-array)
16) <num-bytes>
17) "net-bytes-all-commands-all-slots"
18) <num-bytes>
19) "collection-start-time-unix-ms"
20) <start-time-unix-timestamp-in-ms>
21) "collection-duration-ms"
22) <duration-in-milliseconds>
23) "used-cpu-sys-ms"
24) <duration-in-millisec>
25) "used-cpu-user-ms"
26) <duration-in-millisec>
27) "total-net-bytes"
28) <num-bytes>
29) "by-cpu-time"
30) 1) key-1_1
    2) <millisec>
    ...
    19) key-10_1
    20) <millisec>
31) 1) "by-net-bytes"
32) 1) key-1_2
    2) <num-bytes>
    ...
    19) key-10_2
    20) <num-bytes>

```

### HOTKEYS STOP

Stop tracking session but user can still get results from `HOTKEYS GET`.

### HOTKEYS RESET

Release resources used for hotkeys tracking only when it is stopped.
Return error if a tracking is active.

## Additional changes

The `INFO` command now has a "hotkeys" section with 3 fields
* tracking_active - a boolean flag indicating whether or not we
currently track hotkeys.
* used-memory - memory overhead of the structures used for hotkeys
tracking.
* cpu-time - time in ms spend updating the hotkey structure. 

## Implementation

Independent of API, implementation is based on a probabilistic structure
- [Cuckoo Heavy
Keeper](https://dl.acm.org/doi/abs/10.14778/3746405.3746434) structure
with added min-heap to keep track of top K hotkey's names. CHK is an
loosely based on
[HeavyKeeper](https://www.usenix.org/conference/atc18/presentation/gong)
which is used in RedisBloom's TopK but has higher throughput.

Random fixed probability sampling via the `HOTKEYS start sample <ratio>`
param. Each key is sampled with probability `1/ratio`.

## Performance implications

With low enough sample rate (controlled by `HOTKEYS start sample
<ratio>`) there is negligible performance hit. Tracking every key though
can incur up to 15% hit in [the worst
case](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1Mkeys-string-get-10B-pipeline-500.yml)
after running the tests in this
[bench](https://github.com/redis/redis-benchmarks-specification/).

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
2026-01-16 17:15:28 +02:00
Moti Cohen
11e73c66a8
Modules KeyMeta (Keys Metadata) (#14445)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Modules KeyMeta (Keys Metadata)

Redis modules often need to associate additional metadata with keys in
the keyspace. The objective is to create a unified and extensible
interface, usable by modules, Redis core, and maybe later by the users,
that facilitate the association and management of metadata with keys.
While extending RedisModuleTypes might be an easier path, this proposal
goes one step further: a general-purpose mechanism that lets attach
metadata to any key, independent of underlying data type.

A major part of this feature involves defining how metadata is managed
throughout a key’s lifecycle. Modules will be able to optionally
register distinct metadata classes, each with its own lifecycle
callbacks and capable of storing arbitrary 8-byte value per key. These
metadata values will be embedded directly within Redis’s core key-value
objects to ensure fast access and automatic callback execution as keys
are created, updated, or deleted. Each 8 bytes of metadata can represent
either a simple primitive value or a pointer/handle to more complex,
externally managed data by the module and RDB serialized along with the
key.

Key Features:
- Modules can register up to 7 metadata classes (8 total, 1 reserved)
- Each class: 4-char name + 5-bit version (e.g., "SRC1" v1)
- Each class attaches 8 bytes per key (value or pointer/handle)
- Separate namespace from module data types

Module API:
- RedisModule_CreateKeyMetaClass() - Register metadata class
- RedisModule_ReleaseKeyMetaClass() - Release metadata class
- RedisModule_SetKeyMeta() - Attach/update metadata
- RedisModule_GetKeyMeta() - Retrieve metadata

Lifecycle Callbacks:
- copy, rename, move - Handle key operations
- unlink, free - Handle key deletion/expiration
- rdb_save, rdb_load - RDB persistence
- aof_rewrite - AOF rewrite support

Implementation:
- Metadata slots allocated before kvobj in reverse class ID order
- 8-bit metabits bitmap tracks active classes per key
- Minimal memory overhead - only allocated slots consume memory

RDB Serialization (v13):
- New opcode RDB_OPCODE_KEY_METADATA
- Compact 32-bit class spec: 24-bit name + 5-bit ver + 3-bit flags
- Self-contained format: [META,] TYPE, KEY, VALUE
- Portable across cluster nodes

Integration:
- Core ops: dbAdd, dbSet, COPY, MOVE, RENAME, DELETE
- DUMP/RESTORE support
- AOF rewrite via module callbacks
- Defragmentation support
- Module type I/O refactored to ModuleEntityId
2026-01-15 23:11:17 +02:00
Sergei Georgiev
221409788a
Add idempotency support to XADD via IDMPAUTO and IDMP parameters (#14615)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Overview

This PR introduces idempotency support to Redis Streams' XADD command,
enabling automatic deduplication of duplicate message submissions
through optional IDMPAUTO and IDMP parameters with producer
identification. This enables reliable at-least-once delivery while
preventing duplicate entries in streams.

## Problem Statement

Current Redis Streams implementations lack built-in idempotency
mechanisms, making reliable at-least-once delivery impossible without
accepting duplicates:

- **Application-level tracking**: Developers must maintain separate data
structures to track submitted messages
- **Race conditions**: Network failures and retries can result in
duplicate stream entries
- **Complexity overhead**: Each producer must implement custom
deduplication logic
- **Memory inefficiency**: External deduplication systems duplicate
Redis's storage capabilities

This lack of native idempotency support creates reliability challenges
in distributed systems where at-least-once delivery semantics are
required but exactly-once processing is desired.

## Solution

Extends XADD with optional idempotency parameters that include producer
identification:

```
XADD key [NOMKSTREAM] [KEEPREF | DELREF | ACKED] [IDMPAUTO pid | IDMP pid iid] [MAXLEN | MINID [= | ~] threshold [LIMIT count]] <* | id> field value [field value ...]
```

### Producer ID (pid)

- **pid** (producer id): A unique identifier for each producer
- Must be unique per producer instance
- Producers must use the same pid after restart to access their
persisted idempotency tracking
- Enables per-producer idempotency tracking, isolating duplicate
detection between different producers

**Format**: Binary or string, recommended max 36 bytes

**Generation**: 
- **Recommended**: UUID v4 for globally unique identification
- **Alternative**: `hostname:process_id` or application-assigned IDs

### Idempotency Modes

**IDMPAUTO pid (Automatic Idempotency)**:

- Producer specifies its pid, Redis automatically calculates a unique
idempotent ID (iid) based on entry content
- Hash calculation combines XXH128 hashing of individual field-value
pairs using an order-independent Sum + XOR approach with rotation (each
pair: `XXH128(field || field_length || value)`)
- 16-byte binary iid with extremely low accidental collision probability
- XXH128 is a non-cryptographic hash function: fast and
well-distributed, but does NOT prevent intentional collision attacks
- For protection against adversarial collision crafting, use IDMP mode
with cryptographically-signed idempotent IDs
- Order-independent: field ordering does not affect the calculated iid
- If (pid, iid) pair exists in producer's IDMP map: returns existing
entry ID without creating duplicate entry
- Generally slower than manual mode due to hash calculation overhead

**IDMP pid iid (Manual Idempotency)**:

- Caller provides explicit producer id (pid) and idempotent ID (iid) for
deduplication
- iid must be unique per message (either globally or per pid)
- Faster processing than IDMPAUTO (no hash calculation overhead)
- Enables shorter iids for reduced memory footprint
- If (pid, iid) pair exists in producer's IDMP map: returns existing
entry ID without comparing field contents
- Caller responsible for iid uniqueness and consistency across retries

Both modes can only be specified when entry ID is `*` (auto-generated).

### Deduplication Logic

When XADD is called with idempotency parameters:

1. Redis checks if the message was recently added to the stream based on
the (pid, iid) pair
2. If the (pid, iid) pair matches a recently-seen pair for that
producer, the message is assumed to be identical
3. No duplicate message is added to the stream; the existing entry ID is
returned
4. With **IDMP pid iid**: Redis does not compare the specified fields
and their values—two messages with the same (pid, iid) are assumed
identical
5. With **IDMPAUTO pid**: Redis calculates the iid from message content
and checks for duplicates

## IDMP Map: Per-Producer Time and Capacity-Based Expiration

Each producer with idempotency enabled maintains its own isolated IDMP
map (iid → entry_id) with dual expiration criteria:

**Time-based expiration (duration)**:

- Each iid expires automatically after duration seconds from insertion
- Provides operational guarantee: Redis will not forget an iid before
duration elapses (unless capacity reached)
- Configurable per-stream via XCFGSET

**Capacity-based expiration (maxsize)**:

- Each producer's map enforces maximum capacity of maxsize entries
- When capacity reached, oldest iids for that producer are evicted
regardless of remaining duration
- Prevents unbounded memory growth during extended usage

### Configuration Commands

**XINFO STREAM**: View current configuration and metrics

Use `XINFO STREAM key` to retrieve idempotency configuration
(idmp-duration, idmp-maxsize) along with tracking metrics.

**XCFGSET**: Configure expiration parameters

```
XCFGSET key [IDMP-DURATION duration] [IDMP-MAXSIZE maxsize]
```

- **duration**: Seconds to retain each iid (range: 1- 86400 seconds)
- **maxsize**: Maximum iids to track per producer (range: 1-10,000
entries)
- Calling XCFGSET clears all existing producer IDMP maps for the stream

**Default Configuration** (when XCFGSET not called):

- Duration: 100 seconds
- Maxsize: 100 iids per producer
- Runtime configurable via: `stream-idmp-duration` and
`stream-idmp-maxsize`

## Response Behavior

**On first submission** (pid, iid) pair not in producer's map:

- Entry added to stream with generated entry ID
- (pid, iid) pair stored in producer's IDMP map with current timestamp
- Returns new entry ID

**On duplicate submission** (pid, iid) pair exists in producer's map:

- No entry added to stream
- Returns existing entry ID from producer's IDMP map
- Identical response to original submission (client cannot distinguish)

## Stream Metadata

XINFO STREAM extended with idempotency metrics and configuration:

- **idmp-duration**: The duration value (in seconds) configured for the
stream's IDMP map
- **idmp-maxsize**: The maxsize value configured for the stream's IDMP
map
- **pids-tracked**: Current number of producers with active IDMP maps
- **iids-tracked**: Current total number of iids across all producers'
IDMP maps (reflects active iids that haven't expired or been evicted)
- **iids-added**: Lifetime cumulative count of entries added with
idempotency parameters
- **iids-duplicates**: Lifetime cumulative count of duplicate iids
detected across all producers

## Persistence and Restart Behavior

**IDMP maps are fully persisted and restored across Redis restarts**:

- **RDB/AOF**: All pid-iid pairs, timestamps, and configuration are
included in snapshots and AOF logs
- **Recovery**: On restart, all tracked (pid, iid) pairs remain valid
and operational
- **Producer Requirement**: Producers must reuse the same pid after
restart to access their persisted IDMP map
- **Configuration**: Stream-level settings (duration, maxsize) persist
across restarts
- **Important**: Calling XCFGSET after restart clears restored IDMP maps
(same behavior as during runtime)

## Key Benefits

- **Enables At-most-once Producer Semantics**: Makes it possible to
safely retry message submissions without creating duplicates
- **Automatic Retry Safety**: Network failures and retries cannot create
duplicate entries
- **Producer Isolation**: Each producer maintains independent
idempotency tracking
- **Memory Efficient**: Time and capacity-based expiration per producer
prevents unbounded growth
- **Flexible Implementation**: Choose automatic (IDMPAUTO) or manual
(IDMP) based on performance needs
- **Backward Compatible**: Fully optional parameters with zero impact on
existing XADD behavior
- **Collision Resistant**: XXH128 with Sum + XOR combination and
field-length separators provides high-quality non-cryptographic hashing
for IDMPAUTO with extremely low collision probability and prevents
ambiguous concatenation attacks
2026-01-15 21:58:44 +08:00
Filipe Oliveira (Redis)
7e7c7b0558
Fix flaky test failures in caused by clock precision issues with monotonic clock. (#14697)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Fix flaky test failures in `tests/unit/moduleapi/blockedclient.tcl`
caused by
clock precision issues with monotonic clock.

The test runs a command that blocks for 200ms and then asserts the
elapsed time
is >= 200ms. Due to clock skew and timing precision differences, the
measured
time occasionally comes back as 199ms, causing spurious test failures.
2026-01-14 19:44:05 +08:00
Vitah Lin
e396dd3385
Fix flaky stream LRM test due to timing precision (#14674)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
2026-01-09 10:14:44 +08:00
Yuan Wang
858a8800e2
Propagate migrate task info to replicas (#14672)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
- Allow replicas to track master's migrate task state
Previously, we only propagate import task info to replicas, but now we
also support propagating migrate task info, so the new master can
initiate slots trimming again if needed after failover, this can avoid
data redundancy.

- Prevent replicas from initiating slot trimming actively
Lack of data cleaning mechanism on source side, so we allow replicas to
continue pending slot trimming, but it is not good idea to let replicas
trim actively. As we introduce above feature, we can delete this logic
2026-01-08 19:06:57 +08:00
Slavomir Kaslev
5aa47347e7
Fix CLUSTER SLOT-STATS test Lua scripts (#14671)
Fix hard-coded keys in test Lua scripts which is incompatible with
cluster-mode.

Reported-by: Oran Agra <oran@redis.com>
2026-01-08 11:16:50 +02:00
Stav-Levi
73249497d4
Fix ACL key-pattern bypass in MSETEX command (#14659)
MSETEX doesn't properly check ACL key permissions for all keys - only
the first key is validated.

MSETEX arguments look like: MSETEX <numkeys> key1 val1 key2 val2 ... EX
seconds

Keys are at every 2nd position (step=2). When Redis extracts keys for
ACL checking, it calculates where the last key is:

last = first + numkeys - 1;        => calculation ignores step
last = first + (numkeys-1) * step; 
With 2 keys starting at position 2:

Bug: last = 2 + 2 - 1 = 3 → only checks position 2
Fix: last = 2 + (2-1)*2 = 4 → checks positions 2 and 4

Fixes #14657
2026-01-08 08:41:55 +02:00
Salvatore Sanfilippo
154fdcee01
Test tcp deadlock fixes (#14667)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
**Disclaimer: this patch was created with the help of AI**

My experience with the Redis test not passing on older hardware didn't
stop just with the other PR opened with the same problem. There was
another deadlock happening when the test was writing a lot of commands
without reading it back, and the cause seems related to the fact that
such tests have something in common. They create a deferred client (that
does not read replies at all, if not asked to), flood the server with 1
million of requests without reading anything back. This results in a
networking issue where the TCP socket stops accepting more data, and the
test hangs forever.

To read those replies from time to time allows to run the test on such
older hardware.

Ping oranagra that introduced at least one of the bulk writes tests.
AFAIK there is no problem in the test, if we change it in this way,
since the slave buffer is going to be filled anyway. But better to be
sure that it was not intentional to write all those data without reading
back for some reason I can't see.

IMPORTANT NOTE: **I am NOT sure at all** that the TCP socket senses
congestion in one side and also stops the other side, but anyway this
fix works well and is likely a good idea in general. At the same time, I
doubt there is a pending bug in Redis that makes it hang if the output
buffer is too large, or we are flooding the system with too many
commands without reading anything back. So the actual cause remains
cloudy. I remember that Redis, when the output limit is reached, could
kill the client, and not lower the priority of command processing. Maybe
Oran knows more about this.

## LLM commit message.

The test "slave buffer are counted correctly" was hanging indefinitely
on slow machines. The test sends 1M pipelined commands without reading
responses, which triggers a TCP-level deadlock.

Root cause: When the test client sends commands without reading
responses:
1. Server processes commands and sends responses
2. Client's TCP receive buffer fills (client not reading)
3. Server's TCP send buffer fills
4. Packets get dropped due to buffer pressure
5. TCP congestion control interprets this as network congestion
6. cwnd (congestion window) drops to 1, RTO increases exponentially
7. After multiple backoffs, RTO reaches ~100 seconds
8. Connection becomes effectively frozen

This was confirmed by examining TCP socket state showing cwnd:1,
backoff:9, rto:102912ms, and rwnd_limited:100% on the client side.

The fix interleaves reads with writes by processing responses every
10,000 commands. This prevents TCP buffers from filling to the point
where congestion control triggers the pathological backoff behavior.

The test still validates the same functionality (slave buffer memory
accounting) since the measurement happens after all commands complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 14:26:22 +08:00
debing.sun
0cb1ee0dc1
New eviction policies - least recently modified (#14624)
### Summary

This PR introduces two new maxmemory eviction policies: `volatile-lrm`
and `allkeys-lrm`.
LRM (Least Recently Modified) is similar to LRU but only updates the
timestamp on write operations, not read operations. This makes it useful
for evicting keys that haven't been modified recently, regardless of how
frequently they are read.

### Core Implementation

The LRM implementation reuses the existing LRU infrastructure but with a
key difference in when timestamps are updated:

- **LRU**: Updates timestamp on both read and write operations
- **LRM**: Updates timestamp only on write operations via `updateLRM()`

### Key changes:
Add `keyModified()` to accept an optional `robj *val` parameter and call
`updateLRM()` when a value is provided. Since `keyModified()` serves as
the unified entry point for all key modifications, placing the LRM
update here ensures timestamps are consistently updated across all write
operations

---------

Co-authored-by: oranagra <oran@redislabs.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2026-01-06 20:57:31 +08:00
debing.sun
9ca860be9e
Fix XTRIM/XADD with approx not deletes entries for DELREF/ACKED strategies (#14623)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This bug was introduced by #14130 and found by guybe7 

When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete
strategies, if a node was eligible for removal but couldn't be removed
directly (because consumer group references need to be checked), the
code would incorrectly break out of the loop instead of continuing to
process entries within the node. This fix allows the per-entry deletion
logic to execute for eligible nodes when using non-KEEPREF strategies.
2026-01-05 21:17:36 +08:00
debing.sun
4eda670de9
Fix infinite loop during reverse iteration due to invalid numfields of corrupted stream (#14472)
Follow https://github.com/redis/redis/pull/14423

In https://github.com/redis/redis/pull/14423,
I thought the last lpNext operation of the iterator occurred at the end
of streamIteratorGetID.
However, I overlooked the fact that after calling
`streamIteratorGetID()`, we might still use `streamIteratorGetField()`
to continue moving within the current entry.
This means that during reverse iteration, the iterator could move back
to a previous entry position.

To fix this, in this PR I record the current position at the beginning
of streamIteratorGetID().
When we enter it again next time, we ensure that the entry position does
not exceed the previous one,
that is, during forward iteration the entry must be greater than the
last entry position,
and during reverse iteration it must be smaller than the last entry
position.

Note that the fix for https://github.com/redis/redis/pull/14423 has been
replaced by this fix.
2026-01-05 21:16:53 +08:00
Stav-Levi
860b8c772a
Add TLS certificate-based automatic client authentication (#14610)
This PR implements support for automatic client authentication based on
a field in the client's TLS certificate.
We adopt ValKey’s PR: https://github.com/valkey-io/valkey/pull/1920

API Changes:

Add New configuration tls-auth-clients-user  
  -  Allowed values: `off` (default), `CN`.
  - `off` – disable TLS certificate–based auto-authentication.
- `CN` – derive the ACL username from the Common Name (CN) field of the
client certificate.
 
New INFO stat
  - `acl_access_denied_tls_cert`
- Counts failed TLS certificate–based authentication attempts, i.e. TLS
connections where a client certificate was presented, a username was
derived from it, but no matching ACL user was found.

New ACL LOG reason
  - Reason string: `"tls-cert"`
- Emitted when a client certificate’s Common Name fails to match any
existing ACL user.


Implementation Details:

- Added getCertFieldByName() utility to extract fields from peer
certificates.

- Added autoAuthenticateClientFromCert() to handle automatic login logic
post-handshake.

- Integrated automatic authentication into the TLSAccept function after
handshake completion.

- Updated test suite (tests/integration/tls.tcl) to validate the
feature.
2025-12-25 14:07:58 +02:00
Ozan Tezcan
fde3576f88
Fix adjacent slot range behavior in ASM operations (#14637)
This PR containts a few changes for ASM:

**Bug fix:** 
- Fixes an issue in ASM when adjacent slot ranges are provided in
CLUSTER MIGRATION IMPORT command (e.g. 0-10 11-100). ASM task keeps the
original slot ranges as given, but later the source node reconstructs
the slot ranges from the config update as a single range (e.g. 0-100).
This causes asmLookupTaskBySlotRangeArray() to fail to match the task,
and the source node incorrectly marks the ASM task as failed. Although
the migration completes successfully, the source node performs a
blocking trim operation for these keys, assuming the slot ownership
changed outside of an ASM operation. With this PR, redis merges adjacent
slot ranges in a slot range array to avoid this problem.
 
 **Other improvements:**
- Indicates imported/migrated key count in the log once asm operation is
completed.
 - Use error return value instead of assert in parseSlotRangesOrReply()
- Validate slot range array that is given by cluster implementation on
ASM_EVENT_IMPORT_START.

---------

Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-12-23 11:54:12 +03:00
Yuan Wang
33391a7b61
Support delay trimming slots after finishing migrating slots (#14567)
This PR introduces a mechanism that allows a module to temporarily
disable trimming after an ASM migration operation so it can safely
finish ongoing asynchronous jobs that depend on keys in migrating (and
about to be trimmed) slots.

1. **ClusterDisableTrim/ClusterEnableTrim**
We introduce `ClusterDisableTrim/ClusterEnableTrim` Module APIs to allow
module to disable/enable slot migration
    ```
    /* Disable automatic slot trimming. */
    int RM_ClusterDisableTrim(RedisModuleCtx *ctx)

    /* Enable automatic slot trimming */
    int RM_ClusterEnableTrim(RedisModuleCtx *ctx)
    ```

**Please notice**: Redis will not start any subsequent import or migrate
ASM operations while slot trimming is disabled, so modules must
re-enable trimming immediately after completing their pending work.

The only valid and meaningful time for a module to disable trimming
appears to be after the MIGRATE_COMPLETED event.

2. **REDISMODULE_OPEN_KEY_ACCESS_TRIMMED**
Added REDISMODULE_OPEN_KEY_ACCESS_TRIMMED to RM_OpenKey() so that module
can operate with these keys in the unowned slots after trim is paused.

And now we don't delete the key if it is in trim job when we access it.
And `expireIfNeeded` returns `KEY_VALID` if
`EXPIRE_ALLOW_ACCESS_TRIMMED` is set, otherwise, returns `KEY_TRIMMED`
without deleting key.

3. **REDISMODULE_CTX_FLAGS_TRIM_IN_PROGRESS**
We also extend RM_GetContextFlags() to include a flag
REDISMODULE_CTX_FLAGS_TRIM_IN_PROGRESS indicating whether a trimming job
is pending (due to trim pause) or in progress. Modules could
periodically poll this flag to synchronize their internal state, e.g.,
if a trim job was delayed or if the module incorrectly assumed trimming
was still active.

Bugfix: RM_SetClusterFlags could not clear a flag after enabling it first.

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
2025-12-16 16:30:56 +08:00
Yuan Wang
f3316c3a1a
Introduce flushdb option for repl-diskless-load (#14596)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
`repl-diskless-load` feature can effectively reduce the time of full
synchronization, but maybe it is not widely used.
`swapdb` option needs double `maxmemory`, and `on-empty-db` only works
on the first full sync (the replica must have no data).

This PR introduce a new option: `flushdb` - Always flush the entire
dataset before diskless load. If the diskless load fails, the replica
will lose all existing data.

Of course, it brings the risk of data loss, but it provides a choice if
you want to reduce full sync time and accept this risk.
2025-12-15 11:25:53 +08:00
Stav-Levi
23aca15c8c
Fix the flexibility of argument positions in the Redis API's (#14416)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR implements flexible keyword-based argument parsing for all 12
hash field expiration commands, allowing users to specify arguments in
any logical order rather than being constrained by rigid positional
requirements.
This enhancement follows Redis's modern design of keyword-based flexible
argument ordering and significantly improves user experience.

Commands with Flexible Parsing
HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETEX, HSETEX

some examples: 
HEXPIRE: 
* All these are equivalent and valid:
HEXPIRE key EX 60 NX FIELDS 2 f1 f2
HEXPIRE key NX EX 60 FIELDS 2 f1 f2  
HEXPIRE key FIELDS 2 f1 f2 EX 60 NX
HEXPIRE key FIELDS 2 f1 f2 NX EX 60
HEXPIRE key NX FIELDS 2 f1 f2 EX 60

HGETEX:
* All these are equivalent and valid:
HGETEX key EX 60 FIELDS 2 f1 f2
HGETEX key FIELDS 2 f1 f2 EX 60

HSETEX:
* All these are equivalent and valid:
HSETEX key FNX EX 60 FIELDS 2 f1 v1 f2 v2
HSETEX key EX 60 FNX FIELDS 2 f1 v1 f2 v2
HSETEX key FIELDS 2 f1 v1 f2 v2 FNX EX 60
HSETEX key FIELDS 2 f1 v1 f2 v2 EX 60 FNX
HSETEX key FNX FIELDS 2 f1 v1 f2 v2 EX 60
2025-12-14 09:35:12 +02:00
debing.sun
679e009b73
Add daily CI for vectorset (#14302)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
2025-12-10 08:52:43 +08:00
Slavomir Kaslev
5299ccf2a9
Add kvstore type and decouple kvstore from its metadata (#14543)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Decouple kvstore from its metadata by introducing `kvstoreType` structure of
callbacks. This resolves the abstraction layer violation of having kvstore
include `server.h` directly.

Move (again) cluster slot statistics to per slot dicts' metadata. The callback
`canFreeDict` is used to prevent freeing empty per slot dicts from losing per
slot statistics.

Co-authored-by: Ran Tidhar <ran.tidhar@redis.com>
2025-12-08 21:12:33 +02:00
Yuan Wang
cb71dec0c3
Disable RDB compression when diskless replication is used (#14575)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Fixes #14538

If the master uses diskless synchronization and the replica uses
diskless load, we can disable RDB compression to reduce full sync time.
I tested on AWS and found we could reduce time by 20-40%.

In terms of implementation, when the replica can use diskless load, the
replica will send `replconf rdb-no-compress 1` to master to deliver a
RDB without compression.

If your network is slow, please disable repl-diskless-load, and maybe
even repl-diskless-sync

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
2025-12-04 09:24:23 +08:00
Ozan Tezcan
08b63b6ceb
Fix flaky ASM tests (#14604)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
1. Fix "Simple slot migration with write load" by introducing artificial
delay to traffic generator to slow down it for tsan builds. Failed test:
https://github.com/redis/redis/actions/runs/19720942981/job/56503213650

2. Fix "Test RM_ClusterCanAccessKeysInSlot returns false for unowned
slots" by waiting config propagation before checking it on a replica.
Failed test:
https://github.com/redis/redis/actions/runs/19841852142/job/56851802772
2025-12-03 12:12:48 +03:00
Ozan Tezcan
3c57a8fc92
Retry an ASM import step when the source node is temporarily not ready (#14599)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
The cluster implementation may be temporarily unavailable and return an
error to the `ASM_EVENT_MIGRATE_PREP` event to prevent starting a new
migration. Although this is most likely a transient condition, the
source node has no way to distinguish it from a real error, so it must
fail the import attempt and start a new one.

In Redis, failing an attempt is cheap, but in other cluster
implementations it may require cleaning up resources and can cause
unnecessary disruption.

This PR introduces a new `-NOTREADY` error reply for the `CLUSTER
SYNCSLOTS SYNC` command. When the source replies with `-NOTREADY`, the
destination can recognize the condition as transient and retry sending
`CLUSTER SYNCSLOTS SYNC` step periodically instead of failing the
attempt.
2025-12-02 13:38:22 +03:00
Oran Agra
82fbf213eb
fix test tag leakage that can result in skipping tests (#14572)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
some error handling paths didn't remove the tags they added, but most
importantly, if the start_server proc is given the "tags" argument more
than once, on exit, it only removed the last one.

this problem exists in start_cluster in list.tcl, and the result was
that the "external:skip cluster modules" were not removed
2025-11-26 09:13:21 +02:00
RoyBenMoshe
39200596f4
SCAN: restore original filter order (#14537)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
In #14121, the SCAN filters order was changed, before #14121the order
was - pattern, expiration and type, after #14121pattern became last,
this break change broke the original behavior, which will cause scan
with pattern also to remove the expired keys.
This PR reorders the filters to be consistent with the original behavior
and extends a test to cover this scenario.
2025-11-25 15:30:43 +08:00
lihp
0288d70820
Fixes an issue where EXEC checks ACL during AOF loading (#14545)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
This PR fixes an issue(#14541) where EXEC’s ACL recheck was still being
performed during AOF loading, that may cause AOF loading failed, if ACL
rules are changed and don't allow some commands in MULTI-EXEC.
2025-11-22 11:52:31 +08:00
debing.sun
bb6389e823
Fix min_cgroup_last_id cache not updated when destroying consumer group (#14552)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Problem

When destroying a consumer group with `XGROUP DESTROY`, the cached
`min_cgroup_last_id` was not being invalidated. This caused incorrect
behavior when using `XDELEX` with the `ACKED` option, as the cache still
referenced the destroyed group's `last_id`.

## Solution

Invalidate the `min_cgroup_last_id` cache when the destroyed group's
`last_id` equals the cached minimum. The cache will be recalculated on
the next call to `streamEntryIsReferenced()`.

---------

Co-authored-by: guybe7 <guy.benoish@redislabs.com>
2025-11-21 22:37:17 +08:00
Ozan Tezcan
b632e9df6a
Fix flaky ASM write load test (#14551)
Extend write pause timeout to stabilize ASM write load test under TSAN.

Failing test for reference:
https://github.com/redis/redis/actions/runs/19520561209/job/55882882951
2025-11-21 12:18:28 +03:00
Yuan Wang
7a3cb3b4b3
Fix CI flaky tests (#14531)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
- https://github.com/redis/redis/actions/runs/19200504999/job/54887625884
   avoid calling `start_write_load` before pausing the destination node

- https://github.com/redis/redis/actions/runs/18958533020/job/54140746904
maybe the replica did not sync with master, then the replica did not update the counter
2025-11-19 17:10:57 +08:00
Mincho Paskalev
837b14c89a
Fix ASan Daily (#14527)
After https://github.com/redis/redis/pull/14226 module tests started
running with ASan enabled.

`auth.c` blocks the user on auth and spawns a thread that sleeps for
0.5s before unblocking the client and returning.

A tcl tests unloads the module which may happen just after the spawned
thread unblocks the client. In that case if the unloading finishes fast
enough the spawned thread may try to execute code from the module's
dynamic library that is already unloaded resulting in sefault.

Fix: just wait on the thread during module's OnUnload method.
2025-11-19 10:56:18 +02:00
Oran Agra
0a6eacff1f
Add variable key-spec flags to SET IF* and DELEX (#14529)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
These commands behave as DEL and SET (blindly Remove or Overwrite) when
they don't get IF* flags, and require the value of the key when they do
run with these flags.

Making sure they have the VARIABLE_FLAGS flag, and getKeysProc that can
provide the right flags depending on the arguments used. (the plain
flags when arguments are unknown are the common denominator ones)

Move lookupKey call in DELEX to avoid double lookup, which also means
(some, namely arity) syntax errors are checked (and reported) before
checking the existence of the key.
2025-11-12 11:36:10 +02:00
Sergei Georgiev
90ba7ba4dc
Fix XREADGROUP CLAIM to return delivery metadata as integers (#14524)
### Problem
The XREADGROUP command with CLAIM parameter incorrectly returns delivery
metadata (idle time and delivery count) as strings instead of integers,
contradicting the Redis specification.

### Solution
Updated the XREADGROUP CLAIM implementation to return delivery metadata
fields as integers, aligning with the documented specification and
maintaining consistency with Redis response conventions.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-11-11 19:05:22 +08:00