Add Atomic Slot Migration (ASM) support (#14414)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled

## <a name="overview"></a> Overview 
This PR is a joint effort with @ShooterIT . I’m just opening it on
behalf of both of us.

This PR introduces Atomic Slot Migration (ASM) for Redis Cluster — a new
mechanism for safely and efficiently migrating hash slots between nodes.

Redis Cluster distributes data across nodes using 16384 hash slots, each
owned by a specific node. Sometimes slots need to be moved — for
example, to rebalance after adding or removing nodes, or to mitigate a
hot shard that’s overloaded. Before ASM, slot migration was non-atomic
and client-dependent, relying on CLUSTER SETSLOT, GETKEYSINSLOT, MIGRATE
commands, and client-side handling of ASK/ASKING replies. This process
was complex, error-prone, slow and could leave clusters in inconsistent
states after failures. Clients had to implement redirect logic,
multi-key commands could fail mid-migration, and errors often resulted
in orphaned keys or required manual cleanup. Several related discussions
can be found in the issue list, some examples:
https://github.com/redis/redis/issues/14300 ,
https://github.com/redis/redis/issues/4937 ,
https://github.com/redis/redis/issues/10370 ,
https://github.com/redis/redis/issues/4333 ,
https://github.com/redis/redis/issues/13122,
https://github.com/redis/redis/issues/11312

Atomic Slot Migration (ASM) makes slot rebalancing safe, transparent,
and reliable, addressing many of the limitations of the legacy migration
method. Instead of moving keys one by one, ASM replicates the entire
slot’s data plus live updates to the target node, then performs a single
atomic handoff. Clients keep working without handling ASK/ASKING
replies, multi-key operations remain consistent, failures don’t leave
partial states, and replicas stay in sync. The migration process also
completes significantly faster. Operators gain new commands (CLUSTER
MIGRATION IMPORT, STATUS, CANCEL) for monitoring and control, while
modules can hook into migration events for deeper integration.

### The problems of legacy method in detail

Operators and developers ran into multiple issues with the legacy
method, some of these issues in detail:

1. **Redirects and Client Complexity:** While a slot was being migrated,
some keys were already moved while others were not. Clients had to
handle `-ASK` and `-ASKING` responses, reissuing requests to the target
node. Not all client libraries implemented this correctly, leading to
failed commands or subtle bugs. Even when implemented, it increased
latency and broke naive pipelines.
2. **Multi-Key Operations Became Unreliable:** Commands like `MGET key1
key2` could fail with `TRYAGAIN` if part of the slot was already
migrated. This made application logic unpredictable during resharding.
3. **Risk of failure:** Keys were moved one-by-one (with MIGRATE
command). If the source crashed, or the destination ran out of memory,
the system could be left in an inconsistent state: some keys moved,
others lost, slots partially migrated. Manual intervention was often
needed, sometimes resulting in data loss.
4. **Replica and Failover Issues:** Replicas weren’t aware of migrations
in progress. If a failover occurred mid-migration, manual intervention
was required to clean up or resume the process safely.
5. **Operational Overhead:** Operators had to coordinate multiple
commands (CLUSTER SETSLOT, MIGRATE, GETKEYSINSLOT, etc.) with little
visibility into progress or errors, making rebalancing slow and
error-prone.
6. **Poor performance:** Key-by-key migration was inherently slow and
inefficient for large slot ranges.
7. **Large keys:** Large keys could fail to migrate or cause latency
spikes on the destination node.

### How Atomic Slot Migration Fixes This

Atomic Slot Migration (ASM) eliminates all of these issues by:

1. **Clients:** Clients no longer need to handle ASK/ASKING; the
migration is fully transparent.
2. **Atomic ownership transfer:** The entire slot’s data (snapshot +
live updates) is replicated and handed off in a single atomic step.
3. **Performance**: ASM completes migrations significantly faster by
streaming slot data in parallel (snapshot + incremental updates) and
eliminating key-by-key operations.
4. **Consistency guarantees:** Multi-key operations and pipelines
continue to work reliably throughout migration.
5. **Resilience:** Failures no longer leave orphaned keys or partial
states; migration tasks can be retried or safely cancelled.
6. **Replica awareness:** Replicas remain consistent during migration,
and failovers will no longer leave partially imported keys.
7. **Operator visibility:** New CLUSTER MIGRATION subcommands (IMPORT,
STATUS, CANCEL) provide clear observability and management for
operators.


### ASM Diagram and Migration Steps

```
      ┌─────────────┐               ┌────────────┐     ┌───────────┐      ┌───────────┐ ┌───────┐        
      │             │               │Destination │     │Destination│      │ Source    │ │Source │        
      │  Operator   │               │   master   │     │ replica   │      │ master    │ │ Fork  │        
      │             │               │            │     │           │      │           │ │       │        
      └──────┬──────┘               └─────┬──────┘     └─────┬─────┘      └─────┬─────┘ └───┬───┘        
             │                            │                  │                  │           │            
             │                            │                  │                  │           │            
             │CLUSTER MIGRATION IMPORT    │                  │                  │           │            
             │   <start-slot> <end-slot>..│                  │                  │           │            
             ├───────────────────────────►│                  │                  │           │            
             │                            │                  │                  │           │            
             │   Reply with <task-id>     │                  │                  │           │            
             │◄───────────────────────────┤                  │                  │           │            
             │                            │                  │                  │           │            
             │                            │                  │                  │           │            
             │                            │ CLUSTER SYNCSLOTS│SYNC              │           │            
             │ CLUSTER MIGRATION STATUS   │   <task-id> <start-slot> <end-slot>.│           │            
Monitor      │   ID <task-id>             ├────────────────────────────────────►│           │            
task      ┌─►├───────────────────────────►│                  │                  │           │            
state     │  │                            │                  │                  │           │            
till      │  │      Reply status          │  Negotiation with multiple channels │           │            
completed └─ │◄───────────────────────────┤      (i.e rdbchannel repl)          │           │            
             │                            │◄───────────────────────────────────►│           │            
             │                            │                  │                  │  Fork     │            
             │                            │                  │                  ├──────────►│ ─┐         
                                          │                  │                  │           │  │         
                                          │   Slot snapshot as RESTORE commands │           │  │         
                                          │◄────────────────────────────────────────────────┤  │         
                                          │   Propagate      │                  │           │  │         
      ┌─────────────┐                     ├─────────────────►│                  │           │  │         
      │             │                     │                  │                  │           │  │ Snapshot
      │   Client    │                     │                  │                  │           │  │ delivery
      │             │                     │   Replication stream for slot range │           │  │ duration
      └──────┬──────┘                     │◄────────────────────────────────────┤           │  │         
             │                            │   Propagate      │                  │           │  │         
             │                            ├─────────────────►│                  │           │  │         
             │                            │                  │                  │           │  │         
             │    SET key value1          │                  │                  │           │  │         
             ├─────────────────────────────────────────────────────────────────►│           │  │         
             │         +OK                │                  │                  │           │ ─┘         
             │◄─────────────────────────────────────────────────────────────────┤           │            
             │                            │                  │                  │           │            
             │                            │    Drain repl stream                │ ──┐       │            
             │                            │◄────────────────────────────────────┤   │       │            
             │    SET key value2          │                  │                  │   │       │            
             ├─────────────────────────────────────────────────────────────────►│   │Write  │            
             │                            │                  │                  │   │pause  │            
             │                            │                  │                  │   │       │            
             │                            │  Publish new config via cluster bus │   │       │            
             │       +MOVED               ├────────────────────────────────────►│ ──┘       │            
             │◄─────────────────────────────────────────────────────────────────┤ ──┐       │            
             │                            │                  │                  │   │       │            
             │                            │                  │                  │   │Trim   │            
             │                            │                  │                  │ ──┘       │            
             │     SET key value2         │                  │                  │           │            
             ├───────────────────────────►│                  │                  │           │            
             │         +OK                │                  │                  │           │            
             │◄───────────────────────────┤                  │                  │           │            
             │                            │                  │                  │           │            
             │                            │                  │                  │           │            
 ```

### New commands introduced

There are two new commands: 
1. A command to start, monitor and cancel the migration operation:  `CLUSTER MIGRATION <arg>`
2. An internal command to manage slot transfer between source and destination:  `CLUSTER SYNCSLOTS <arg>` For more details, please refer to the [New Commands](#new-commands) section. Internal command messaging is mostly omitted in the diagram for simplicity.


### Steps
1. Slot migration begins when the operator sends `CLUSTER MIGRATION IMPORT <start-slot> <end-slot> ...`
to the destination master. The process is initiated from the destination node, similar to REPLICAOF. This approach allows us to reuse the same logic and share code with the new replication mechanism (see https://github.com/redis/redis/pull/13732). The command can include multiple slot ranges. The destination node creates one migration task per source node, regardless of how many slot ranges are specified. Upon successfully creating the task, the destination node replies IMPORT command with the assigned task ID. The operator can then monitor progress using `CLUSTER MIGRATION STATUS ID <task-id>` . When the task’s state field changes to `completed`, the migration has finished successfully. Please see [New Commands](#new-commands) section for the output sample. 
2. After creating the migration task,  the destination node will request replication of slots by using the internal command `CLUSTER SYNCSLOTS`.
3. Once the source node accepts the request, the destination node establishes another separate connection(similar to rdbchannel replication) so snapshot data and incremental changes can be transmitted in parallel.
4. Source node forks, starts delivering snapshot content (as per-key RESTORE commands) from one connection and incremental changes from the other connection. The destination master starts applying commands from the snapshot connection and accumulates incremental changes. Applied commands are also propagated to the destination replicas via replication backlog.

    Note: Only commands of related slots are delivered to the destination node. This is done by writing them to the migration client’s output buffer, which serves as the replication stream for the migration operation.
5. Once the source node finishes delivering the snapshot and determines that the destination node has caught up (remaining repl stream to consume went under a configured limit), it pauses write traffic for the entire server. After pausing the writes, the source node forwards any remaining write commands to the destination node.

6. Once the destination consumes all the writes, it bumps up cluster config epoch and changes the configuration. New config is published via cluster bus.
7. When the source node receives the new configuration, it can redirect clients and it begins trimming the migrated slots, while also resuming write traffic on the server.

### Internal slots synchronization state machine
![asm state machine](https://github.com/user-attachments/assets/b7db353c-969e-4bde-b77f-c6abe5aa13d3)

1. The destination node performs authentication using the cluster secret introduced in #13763 , and transmits its node ID information.
2. The destination node sends `CLUSTER SYNCSLOTS SYNC <task-id> <start-slot> <end-slot>` to initiate a slot synchronization request and establish the main channel. The source node responds with `+RDBCHANNELSYNCSLOTS`, indicating that the destination node should establish an RDB channel.
3. The destination node then sends `CLUSTER SYNCSLOTS RDBCHANNEL <task-id>` to establish the RDB channel, using the same task-id as in the previous step to associate the two connections as part of the same ASM task.
The source node replies with `+SLOTSSNAPSHOT`, and `fork` a child process to transfer slot snapshot.
4. The destination node applies the slot snapshot data received over the RDB channel, while proxying the command stream to replicas. At the same time, the main channel continues to read and buffer incremental commands in memory.
5. Once the source node finishes sending the slot snapshot, it notifies the destination node using the `CLUSTER SYNCSLOTS SNAPSHOT-EOF` command. The destination node then starts streaming the buffered commands while continuing to read and buffer incremental commands sent from the source.
6. The destination node periodically sends `CLUSTER SYNCSLOTS ACK <offset>` to inform the source of the applied data offset. When the offset gap meets the threshold, the source node pauses write operations. After all buffered data has been drained, it sends `CLUSTER SYNCSLOTS STREAM-EOF` to the destination node to hand off slots.
7. Finally, the destination node takes over slot ownership, updates the slot configuration and bumps the epoch, then broadcasts the updates via cluster bus. Once the source node detects the updated slot configuration, the slot migration process is complete. 

### Error handling
- If the connection between the source and destination is lost (due to disconnection, output buffer overflow, OOM, or timeout), the destination node automatically restarts the migration from the beginning. The destination node will retry the operation until it is explicitly cancelled using the CLUSTER MIGRATION CANCEL <task-id> command.
- If a replica connection drops during migration, it can later resume with PSYNC, since the imported slot data is also written to the replication backlog.
- During the write pause phase, the source node sets a timeout. If the destination node fails to drain remaining replication data and update the config during that time, the source node assumes the destination has failed and automatically resumes normal writes for the migrating slots.
- On any error, the destination node triggers a trim operation to discard any partially imported slot data.
- If node crashes during importing, unowned keys are deleted on start up. 


### <a name="slot-snapshot-format-considerations"></a> Slot Snapshot Format Considerations 

When the source node forks to deliver slot content, in theory, there are several possible formats for transmitting the snapshot data:

**Mini RDB**:A compact RDB file containing only the keys from the migrating slots. This format is efficient for transmission, but it cannot be easily forwarded to destination-side replicas.
**AOF format**: The source node can generate commands in AOF form (e.g., SET x y, HSET h f v) and stream them. Individual commands are easily appended to the replication stream and propagated to replicas. Large keys can also be split into multiple commands (incrementally reconstructing the value), similar to the AOF rewrite process.
**RESTORE commands**: Each key is serialized and sent as a `RESTORE` command. These can be appended directly to the destination’s replication stream, though very large keys may make serialization and transmission less efficient.

We chose the `RESTORE` command as default approach for the following reasons:
- It can be easily propagated to replicas.
- It is more efficient than AOF for most cases, and some module keys do not support the AOF format.
- For large **non-module** keys that are not string, ASM automatically switches to the AOF-based key encoding as an optimization when the key’s cardinality exceeds 512. This approach allows the key to be transferred in chunks rather than as a single large payload, reducing memory pressure and improving migration efficiency. In future versions, the RESTORE command may be enhanced to handle large keys more efficiently.

Some details:
- For RESTORE commands, normally by default Redis compresses keys. We disable compression while delivering RESTORE commands as compression comes with a performance hit. Without compression, replication is several times faster. 
- For string keys, we still prefer AOF format, e.g. SET commands as it is currently more efficient than RESTORE, especially for big keys.

### <a name="trimming-the-keys"></a> Trimming the keys 

When a migration completes successfully, the source node deletes the migrated keys from its local database.
Since the migrated slots may contain a large number of keys, this trimming process must be efficient and non-blocking.

In cluster mode, Redis maintains per-slot data structures for keys, expires, and subexpires. This organization makes it possible to efficiently detach all data associated with a given slot in a single step. During trimming, these slot-specific data structures are handed off to a background I/O (BIO) thread for asynchronous cleanup—similar to how FLUSHALL or FLUSHDB operate. This mechanism is referred to as background trimming, and it is the preferred and default method for ASM, ensuring that the main thread remains unblocked.

However, unlike Redis itself, some modules may not maintain per-slot data structures and therefore cannot drop related slots data in a single operation. To support these cases, Redis introduces active trimming, where key deletion occurs in the main thread instead. This is not a blocking operation, trimming runs concurrently in the main thread, periodically removing keys during the cron loop. Each deletion triggers a keyspace notification so that modules can react to individual key removals. While active trim is less efficient, it ensures backward compatibility for modules during the transition period.

Before starting the trim, Redis checks whether any module is subscribed to newly added  `REDISMODULE_NOTIFY_KEY_TRIMMED` keyspace event. If such subscribers exist, active trimming is used; otherwise, background trimming is triggered. Going forward, modules are expected to adopt background trimming to take advantage of its performance and scalability benefits, and active trimming will be phased out once modules migrate to the new model.

Redis also prefers active trimming if there is any client that is using client tracking feature (see [client-side caching](https://redis.io/docs/latest/develop/reference/client-side-caching/)). In the current client tracking protocol, when a database is flushed (e.g., via the FLUSHDB command), a null value is sent to tracking clients to indicate that they should invalidate all locally cached keys. However, there is currently no mechanism to signal that only specific slots have been flushed. Iterating over all keys in the slots to be trimmed would be a blocking operation. To avoid this, if there is any client that is using client tracking feature, Redis automatically switches to active trimming mode. In the future, the client tracking protocol can be extended to support slot-based invalidation, allowing background trimming to be used in these cases as well.

Finally, trimming may also be triggered after a migration failure. In such cases, the operation ensures that any partially imported or inconsistent slot data is cleaned up, maintaining cluster consistency and preventing stale keys from remaining in the source or destination nodes.

Note about active trim: Subsequent migrations can complete while a prior trim is still running. In that case, the new migration’s trim job is queued and will start automatically after the current trim finishes. This does not affect slot ownership or client traffic—it only serializes the background cleanup.

### <a name="replica-handling"></a> Replica handling 

- During importing, new keys are propagated to destination side replica. Replica will check slot ownership before replying commands like SCAN, KEYS, DBSIZE not to include these unowned keys in the reply. 

  Also, when an import operation begins, the master now propagates an internal command through the replication stream, allowing replicas to recognize that an ASM operation is in progress. This is done by the internal `CLUSTER SYNCSLOTS CONF ASM-TASK` command in the replication stream. This enables replicas to trigger the relevant module events so that modules can adapt their behavior — for example, filtering out unowned keys from read-only requests during ASM operations. To be able to support full sync with RDB delivery scenarios, a new AUX field is also added to the RDB: `cluster-asm-task`. It's value is a string in the format of `task_id:source_node:dest_node:operation:state:slot_ranges`. 

- After a successful migration or on a failed import, master will trim the keys. In that case, master will propagate a new command to the replica: `TRIMSLOTS RANGES <numranges> <start-slot> <end-slot> ... ` . So, the replica will start trimming once this command is received. 

### <a name="propagating-data-outside-the-keyspace"></a> Propagating data outside the keyspace

When the destination node is newly added to the cluster, certain data outside the keyspace may need to be propagated first.
A common example is functions. Previously, redis-cli handled this by transferring functions when a new node was added.
With ASM, Redis now automatically dumps and sends functions to the destination node using `FUNCTION RESTORE ..REPLACE` command — done purely for convenience to simplify setup.

Additionally, modules may also need to propagate their own data outside the keyspace.
To support this, a new API has been introduced: `RM_ClusterPropagateForSlotMigration()`.
See the [Module Support](#module-support) section for implementation details.

### Limitations

1. Single migration at a time: Only one ASM migration operation is allowed at a time. This limitation simplifies the current design but can be extended in the future.

2. Large key handling: For large keys, ASM switches to AOF encoding to deliver key data in chunks. This mechanism currently applies only to non-module keys. In the future, the RESTORE command may be extended to support chunked delivery, providing a unified solution for all key types. See [Slot Snapshot Format Considerations](#slot-snapshot-format-considerations) for details.

3. There are several cases that may cause an Atomic Slot Migration (ASM) to be aborted (can be retried afterwards):
    - FLUSHALL / FLUSHDB: These commands introduce complexity during ASM. For example, if executed on the migrating node, they must be propagated only for the migrating slots. However, when combined with active trimming, their execution may need to be deferred until it is safe to proceed, adding further complexity to the process.
    - FAILOVER: The replica cannot resume the migration process. Migration should start from the beginning.
    - Module propagates cross-slot command during ASM via RM_Replicate(): If this occurs on the migrating node, Redis cannot split the command to propagate only the relevant slots to the ASM destination. To keep the logic simple and consistent, ASM is cancelled in this case. Modules should avoid propagating cross-slot commands during migration.
    - CLIENT PAUSE: The import task cannot progress during a write pause, as doing so would violate the guarantee that no writes occur during migration. To keep things simple, the ASM task is aborted when CLIENT PAUSE is active.
    - Manual Slot Configuration Changes: If slot configuration is modified manually during ASM (for example, when legacy migration methods are mixed with ASM), the process is aborted. Note: This situation is highly unexpected — users should not combine ASM with legacy migration methods.
    
4. When active trimming is enabled, a node must not re-import the same slots while trimming for those slots is still in progress. Otherwise, it can’t distinguish newly imported keys from pre-existing ones, and the trim cron might delete the incoming keys by mistake. In this state, the node rejects IMPORT operation for those slots until trimming completes. If the master has finished trimming but a replica is still trimming, master may still start the import operation for those slots. So, the replica checks whether the master is sending commands for those slots; if so, it blocks the master’s client connection until trimming finishes. This is a corner case, but we believe the behavior is reasonable for now. In the worst case, the master may drop the replica (e.g., buffer overrun), triggering a new full sync.

# API Changes

## <a name="new-commands"></a> New Commands 

### Public commands
1. **Syntax:**  `CLUSTER MIGRATION IMPORT <start-slot> <end-slot> [<start-slot> <end-slot>]...`
  **Args:** Slot ranges
  **Reply:** 
    - String task ID
    - -ERR <message> on failure (e.g. invalid slot range) 

    **Description:** Executes on the destination master. Accepts multiple slot ranges and triggers atomic migration for the specified ranges. Returns a task ID that can be used to monitor the status of the task. In CLUSTER MIGRATION STATUS output, “state” field will be `completed` on a successful operation.

2. **Syntax:**  `CLUSTER MIGRATION CANCEL [ID <id> | ALL]`
  **Args:** Task ID or ALL
  **Reply:** Number of cancelled tasks

    **Description:** Cancels an ongoing migration task by its ID or cancels all tasks if ALL is specified. Note: Cancelling a task on the source node does not stop the migration on the destination node, which will continue retrying until it is also cancelled there.


3. **Syntax:**  `CLUSTER MIGRATION STATUS [ID <id> | ALL]`
  **Args:** Task ID or ALL
    - **ID:** If provided, returns the status of the specified migration task.
    - **ALL:** Lists the status of all migration tasks.

    **Reply:**
      - A list of migration task details (both ongoing and completed ones).
      - Empty list if the given task ID does not exist.

    **Description:** Displays the status of all current and completed atomic slot migration tasks. If a specific task ID is provided, it returns detailed information for that task only.
    
    **Sample output:**
```
127.0.0.1:5001> cluster migration status all
1)  1) "id"
    2) "24cf41718b20f7f05901743dffc40bc9b15db339"
    3) "slots"
    4) "0-1000"
    5) "source"
    6) "1098d90d9ba2d1f12965442daf501ef0b6667bec"
    7) "dest"
    8) "b3b5b426e7ea6166d1548b2a26e1d5adeb1213ac"
    9) "operation"
   10) "migrate"
   11) "state"
   12) "completed"
   13) "last_error"
   14) ""
   15) "retries"
   16) "0"
   17) "create_time"
   18) "1759694528449"
   19) "start_time"
   20) "1759694528449"
   21) "end_time"
   22) "1759694528464"
   23) "write_pause_ms"
   24) "10"
```

### Internal commands

1. **Syntax:**  `CLUSTER SYNCSLOTS <arg> ...`
  **Args:** Internal messaging operations
  **Reply:**  +OK or -ERR <message> on failure (e.g. invalid slot range) 

    **Description:** Used for internal communication between source and destination nodes. e.g. handshaking, establishing multiple channels, triggering handoff.
    
2. **Syntax:**  `TRIMSLOTS RANGES <numranges> <start-slot> <end-slot> ...`
  **Args:** Slot ranges to trim
  **Reply:**  +OK 

    **Description:** Master propagates it to replica so that replica can trim unowned keys after a successful migration or on a failed import. 

## New configs

- `cluster-slot-migration-max-archived-tasks`: To list in `CLUSTER MIGRATION STATUS ALL` output, Redis keeps last n migration tasks in memory. This config controls maximum number of archived ASM tasks. Default value: 32, used as a hidden config
- `cluster-slot-migration-handoff-max-lag-bytes`: After the slot snapshot is completed, if the remaining replication stream size falls below this threshold, the source node pauses writes to hand off slot ownership. A higher value may trigger the handoff earlier but can lead to a longer write pause, since more data remains to be replicated. A lower value can result in a shorter write pause, but it may be harder to reach the threshold if there is a steady flow of incoming writes. Default value: 1MB
- `cluster-slot-migration-write-pause-timeout`: The maximum duration (in milliseconds) that the source node pauses writes during ASM handoff. After pausing writes, if the destination node fails to take over the slots within this timeout (for example, due to a cluster configuration update failure), the source node assumes the migration has failed and resumes writes to prevent indefinite blocking. Default value: 10 seconds
- `cluster-slot-migration-sync-buffer-drain-timeout`: Timeout in milliseconds for sync buffer to be drained during ASM. 
After the destination applies the accumulated buffer, the source continues sending commands for migrating slots. The destination keeps applying them, but if the gap remains above the acceptable limit (see `slot-migration-handoff-max-lag-bytes`), which may cause endless synchronization. A timeout check is required to handle this case.
The timeout is calculated as **the maximum of two values**:
   - A configurable timeout (slot-migration-sync-buffer-drain-timeout) to avoid false positives.
   - A dynamic timeout based on the time that the destination took to apply the slot snapshot and the accumulated buffer during slot snapshot delivery. The destination should be able to drain the remaining sync buffer in less time than this. We multiply it by 2 to be more conservative. 

    Default value: 60000 millliseconds, used as a hidden config

## New flag in CLIENT LIST
- the client responsible for importing slots is marked with the `o` flag.
- the client responsible for migrating slots is marked with the `g` flag.

## New INFO fields

- `mem_cluster_slot_migration_output_buffer`: Memory usage of the migration client’s output buffer. Redis writes incoming changes to this buffer during the migration process.
- `mem_cluster_slot_migration_input_buffer`: Memory usage of the accumulated replication stream buffer on the importing node.
- `mem_cluster_slot_migration_input_buffer_peak`: Peak accumulated repl buffer size on the importing side

## New CLUSTER INFO fields

- `cluster_slot_migration_active_tasks`: Number of in-progress ASM tasks. Currently, it will be 1 or 0. 
- `cluster_slot_migration_active_trim_running`: Number of active trim jobs in progress and scheduled
- `cluster_slot_migration_active_trim_current_job_keys`: Number of keys scheduled for deletion in the current trim job.
- `cluster_slot_migration_active_trim_current_job_trimmed`: Number of keys already deleted in the current trim job.
- `cluster_slot_migration_stats_active_trim_started`: Total number of trim jobs that have started since the process began.
- `cluster_slot_migration_stats_active_trim_completed`: Total number of trim jobs completed since the process began.
- `cluster_slot_migration_stats_active_trim_cancelled`: Total number of trim jobs cancelled since the process began.


## Changes in RDB format

A new aux field is added to RDB: `cluster-asm-task`. When an import operation begins, the master now propagates an internal command through the replication stream, allowing replicas to recognize that an ASM operation is in progress. This enables replicas to trigger the relevant module events so that modules can adapt their behavior — for example, filtering out unowned keys from read-only requests during ASM operations. To be able to support RDB delivery scenarios, a new field is added to the RDB. See [replica handling](#replica-handling)

## Bug fix
- Fix memory leak when processing forgetting node type message
- Fix data race of writing reply to replica client directly when enabling multi-threading
We don't plan to back point them into old versions, since they are very rare cases.

## Keys visibility
When performing atomic slot migration, during key importing on the destination node or key trimming on the source/destination, these keys will be filtered out in the following commands:
- KEYS
- SCAN
- RANDOMKEY
- CLUSTER GETKEYSINSLOT
- DBSIZE
- CLUSTER COUNTKEYSINSLOT

The only command that will reflect the increasing number of keys is:
- INFO KEYSPACE

## <a name="module-support"></a> Module Support 

**NOTE:** Please read [trimming](#trimming-the-keys) section and see how does ASM decide about trimming method when there are modules in use. 

### New notification:
```c
#define REDISMODULE_NOTIFY_KEY_TRIMMED (1<<17) 
```
When a key is deleted by the active trim operation, this notification will be sent to subscribed modules.
Also, ASM will automatically choose the trimming method depending on whether there are any subscribers to this new event. Please see the further details here: [trimming](#trimming-the-keys)


### New struct in the API:
```c
typedef struct RedisModuleSlotRange {
    uint16_t start;
    uint16_t end;
} RedisModuleSlotRange;

typedef struct RedisModuleSlotRangeArray {
    int32_t num_ranges;
    RedisModuleSlotRange ranges[];
} RedisModuleSlotRangeArray;
```

### New Events
#### 1. REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION (RedisModuleEvent_ClusterSlotMigration)

These events notify modules about different stages of Active Slot Migration (ASM) operations such as when import or migration starts, fails, or completes. Modules can use these notifications to track cluster slot movements or perform custom logic during ASM transitions.

```c
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_STARTED 0
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_FAILED 1
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_COMPLETED 2
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_STARTED 3
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_FAILED 4
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_COMPLETED 5
#define
REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE 6
```

Parameter to these events:
```c
typedef struct RedisModuleClusterSlotMigrationInfo {
uint64_t version; /* Not used since this structure is never passed
from the module to the core right now. Here
                               for future compatibility. */
    char source_node_id[REDISMODULE_NODE_ID_LEN + 1];
    char destination_node_id[REDISMODULE_NODE_ID_LEN + 1];
    const char *task_id;
    RedisModuleSlotRangeArray* slots;
} RedisModuleClusterSlotMigrationInfoV1;

#define RedisModuleClusterSlotMigrationInfo
RedisModuleClusterSlotMigrationInfoV1
```


#### 2. REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM (RedisModuleEvent_ClusterSlotMigrationTrim)

These events inform modules about the lifecycle of ASM key trimming operations. Modules can use them to detect when trimming starts, completes, or is performed asynchronously in the background.

```c
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_STARTED     0
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_COMPLETED   1
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_BACKGROUND  2
```

Parameter to these events:
```c
typedef struct RedisModuleClusterSlotMigrationTrimInfo {
uint64_t version; /* Not used since this structure is never passed
from the module to the core right now. Here
                               for future compatibility. */
    RedisModuleSlotRangeArray* slots;
} RedisModuleClusterSlotMigrationTrimInfoV1;

#define RedisModuleClusterSlotMigrationTrimInfo
RedisModuleClusterSlotMigrationTrimInfoV1
```

### New functions

```c
/* Returns 1 if keys in the specified slot can be accessed by this node,
0 otherwise.
 *
 * This function returns 1 in the following cases:
* - The slot is owned by this node or by its master if this node is a
replica
* - The slot is being imported under the old slot migration approach
(CLUSTER SETSLOT <slot> IMPORTING ..)
 * - Not in cluster mode (all slots are accessible)
 *
 * Returns 0 for:
 * - Invalid slot numbers (< 0 or >= 16384)
 * - Slots owned by other nodes
 */
int RM_ClusterCanAccessKeysInSlot(int slot);

/* Propagate commands along with slot migration.
 *
 * This function allows modules to add commands that will be sent to the
* destination node before the actual slot migration begins. It should
only be
* called during the
REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE
event.
 *
 * This function can be called multiple times within the same event to
 * replicate multiple commands. All commands will be sent before the
 * actual slot data migration begins.
 *
* Note: This function is only available in the fork child process just
before
 *       slot snapshot delivery begins.
 *
 * On success REDISMODULE_OK is returned, otherwise
 * REDISMODULE_ERR is returned and errno is set to the following values:
 *
 * * EINVAL: function arguments or format specifiers are invalid.
* * EBADF: not called in the correct context, e.g. not called in the
REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE
event.
 * * ENOENT: command does not exist.
 * * ENOTSUP: command is cross-slot.
* * ERANGE: command contains keys that are not within the migrating slot
range.
 */
int RM_ClusterPropagateForSlotMigration(RedisModuleCtx *ctx,
                                        const char *cmdname,
                                        const char *fmt, ...);

/* Returns the locally owned slot ranges for the node.
 *
 * An optional `ctx` can be provided to enable auto-memory management.
* If cluster mode is disabled, the array will include all slots
(0–16383).
 * If the node is a replica, the slot ranges of its master are returned.
 *
 * The returned array must be freed with RM_ClusterFreeSlotRanges().
 */                                
RedisModuleSlotRangeArray *RM_ClusterGetLocalSlotRanges(RedisModuleCtx
*ctx);

/* Frees a slot range array returned by RM_ClusterGetLocalSlotRanges().
* Pass the `ctx` pointer only if the array was created with a context.
*/
void RM_ClusterFreeSlotRanges(RedisModuleCtx *ctx,
RedisModuleSlotRangeArray *slots);
```

## ASM API for alternative cluster implementations

Following https://github.com/redis/redis/pull/12742, Redis cluster code was restructured to support alternative cluster implementations. Redis uses cluster_legacy.c implementation by default. This PR adds a generic ASM API so alternative implementations can initiate and coordinate Atomic Slot Migration (ASM) while Redis executes the data movement and emits state changes.

Documentation rests in `cluster.h`:

```c
There are two new functions:

/* Called by cluster implementation to request an ASM operation.
(cluster impl --> redis) */
int clusterAsmProcess(const char *task_id, int event, void *arg, char
**err);

/* Called when an ASM event occurs to notify the cluster implementation.
(redis --> cluster impl) */
int clusterAsmOnEvent(const char *task_id, int event, void *arg);
```

```c
/* API for alternative cluster implementations to start and coordinate
 * Atomic Slot Migration (ASM).
 *
* These two functions drive ASM for alternative cluster implementations.
* - clusterAsmProcess(...) impl -> redis: initiates/advances/cancels ASM
operations
 * - clusterAsmOnEvent(...) redis -> impl: notifies state changes
 *
 * Generic steps for an alternative implementation:
* - On destination side, implementation calls
clusterAsmProcess(ASM_EVENT_IMPORT_START)
 *   to start an import operation.
 * - Redis calls clusterAsmOnEvent() when an ASM event occurs.
* - On the source side, Redis will call
clusterAsmOnEvent(ASM_EVENT_HANDOFF_PREP)
* when slots are ready to be handed off and the write pause is needed.
* - Implementation stops the traffic to the slots and calls
clusterAsmProcess(ASM_EVENT_HANDOFF)
* - On the destination side, Redis calls
clusterAsmOnEvent(ASM_EVENT_TAKEOVER)
* when destination node is ready to take over the slot, waiting for
ownership change.
* - Cluster implementation updates the config and calls
clusterAsmProcess(ASM_EVENT_DONE)
 *   to notify Redis that the slots ownership has changed.
 *
 * Sequence diagram for import:
* - Note: shows only the events that cluster implementation needs to
react.
 *
* ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
┌───────────────┐
* │ Destination │ │ Destination │ │ Source │ │ Source │
* │ Cluster impl │ │ Master │ │ Master │ │ Cluster impl │
* └───────┬───────┘ └───────┬───────┘ └───────┬───────┘
└───────┬───────┘
* │ │ │ │
* │ ASM_EVENT_IMPORT_START │ │ │
* ├─────────────────────────────►│ │ │
* │ │ CLUSTER SYNCSLOTS <arg> │ │
* │ ├────────────────────────►│ │
* │ │ │ │
* │ │ SNAPSHOT(restore cmds) │ │
* │ │◄────────────────────────┤ │
* │ │ Repl stream │ │
* │ │◄────────────────────────┤ │
* │ │ │ ASM_EVENT_HANDOFF_PREP │
* │ │ ├────────────────────────────►│
* │ │ │ ASM_EVENT_HANDOFF │
* │ │ │◄────────────────────────────┤
* │ │ Drain repl stream │ │
* │ │◄────────────────────────┤ │
* │ ASM_EVENT_TAKEOVER │ │ │
* │◄─────────────────────────────┤ │ │
* │ │ │ │
* │ ASM_EVENT_DONE │ │ │
* ├─────────────────────────────►│ │ ASM_EVENT_DONE │
* │ │ │◄────────────────────────────┤
* │ │ │ │
 */

#define ASM_EVENT_IMPORT_START 1 /* Start a new import operation
(destination side) */
#define ASM_EVENT_CANCEL 2 /* Cancel an ongoing import/migrate operation
(source and destination side) */
#define ASM_EVENT_HANDOFF_PREP 3 /* Slot is ready to be handed off to
the destination shard (source side) */
#define ASM_EVENT_HANDOFF 4 /* Notify that the slot can be handed off
(source side) */
#define ASM_EVENT_TAKEOVER 5 /* Ready to take over the slot, waiting for
config change (destination side) */
#define ASM_EVENT_DONE 6 /* Notify that import/migrate is completed,
config is updated (source and destination side) */

#define ASM_EVENT_IMPORT_PREP 7 /* Import is about to start, the
implementation may reject by returning C_ERR */
#define ASM_EVENT_IMPORT_STARTED    8  /* Import started */
#define ASM_EVENT_IMPORT_FAILED     9  /* Import failed */
#define ASM_EVENT_IMPORT_COMPLETED 10 /* Import completed (config
updated) */
#define ASM_EVENT_MIGRATE_PREP 11 /* Migrate is about to start, the
implementation may reject by returning C_ERR */
#define ASM_EVENT_MIGRATE_STARTED   12 /* Migrate started */
#define ASM_EVENT_MIGRATE_FAILED    13 /* Migrate failed */
#define ASM_EVENT_MIGRATE_COMPLETED 14 /* Migrate completed (config
updated) */
```

------
Co-authored-by: Yuan Wang <yuan.wang@redis.com>

---------

Co-authored-by: Yuan Wang <yuan.wang@redis.com>
This commit is contained in:
Ozan Tezcan 2025-10-22 15:56:20 +03:00 committed by GitHub
parent 090ca801ea
commit 2bc4e0299d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
45 changed files with 9333 additions and 388 deletions

View file

@ -1827,6 +1827,41 @@ aof-timestamp-enabled no
#
# cluster-slot-stats-enabled no
# Slot migration write pause timeout controls how long the source node will
# pause write operations during slot migration handoff phase. This usually
# finishes in a few milliseconds, depending on traffic and load. When the source
# node pauses writes to allow the destination to catch up and take the ownership
# of the slots, this timeout prevents writes from being blocked indefinitely.
#
# If the destination node fails to complete the slot ownership takeover within
# this timeout, the source node will resume accepting writes and assume the
# migration task is failed. This prevents the source node from being permanently
# blocked if the destination node becomes unresponsive or fails during migration.
#
# If this timeout is set too low, the source may resume writes and assume that
# the slot migration has failed while the destination is still in the process of
# draining the replication stream and publishing the configuration update.
# During this window, writes accepted by the source will not be replicated to
# the destination; if the destination later publishes the updated config and
# takes ownership, those writes could be lost. Therefore, avoid setting this
# timeout too low.
#
# This timeout is specified in milliseconds.
#
# cluster-slot-migration-write-pause-timeout 10000
# This config controls the maximum acceptable lag in bytes between source and
# destination nodes during slot migration before triggering the slot handoff
# phase. If the remaining replication stream size falls below this threshold,
# the source node pauses writes and then signals destination that it can take
# over the slot ownership after draining the remaining replication stream.
#
# A smaller value means potentially shorter write pause duration, but it may
# take longer for the destination to catch up. A larger value means handoff can
# be triggered earlier, but the write pause may potentially be longer.
#
# cluster-slot-migration-handoff-max-lag-bytes 1mb
# In order to setup your cluster make sure to read the documentation
# available at https://redis.io web site.

View file

@ -382,7 +382,7 @@ endif
REDIS_SERVER_NAME=redis-server$(PROG_SUFFIX)
REDIS_SENTINEL_NAME=redis-sentinel$(PROG_SUFFIX)
REDIS_SERVER_OBJ=threads_mngr.o memory_prefetch.o adlist.o quicklist.o ae.o anet.o dict.o ebuckets.o eventnotifier.o iothread.o mstr.o kvstore.o fwtree.o estore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o cluster_slot_stats.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crccombine.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o redis-check-rdb.o redis-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o lolwut8.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o
REDIS_SERVER_OBJ=threads_mngr.o memory_prefetch.o adlist.o quicklist.o ae.o anet.o dict.o ebuckets.o eventnotifier.o iothread.o mstr.o kvstore.o fwtree.o estore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_asm.o cluster_legacy.o cluster_slot_stats.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crccombine.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o redis-check-rdb.o redis-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o lolwut8.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o
REDIS_CLI_NAME=redis-cli$(PROG_SUFFIX)
REDIS_CLI_OBJ=anet.o adlist.o dict.o redis-cli.o zmalloc.o release.o ae.o redisassert.o crcspeed.o crccombine.o crc64.o siphash.o crc16.o monotonic.o cli_common.o mt19937-64.o strl.o cli_commands.o
REDIS_BENCHMARK_NAME=redis-benchmark$(PROG_SUFFIX)

View file

@ -3203,7 +3203,7 @@ void addReplyCommandCategories(client *c, struct redisCommand *cmd) {
/* When successful, initiates an internal connection, that is able to execute
* internal commands (see CMD_INTERNAL). */
static void internalAuth(client *c) {
if (server.cluster == NULL) {
if (!server.cluster_enabled) {
addReplyError(c, "Cannot authenticate as an internal connection on non-cluster instances");
return;
}

View file

@ -11,6 +11,7 @@
#include "bio.h"
#include "rio.h"
#include "functions.h"
#include "cluster_asm.h"
#include <signal.h>
#include <fcntl.h>
@ -2384,11 +2385,48 @@ werr:
return 0;
}
int rewriteObject(rio *r, robj *key, robj *o, int dbid, long long expiretime) {
/* Save the key and associated value */
if (o->type == OBJ_STRING) {
/* Emit a SET command */
static const char cmd[]="*3\r\n$3\r\nSET\r\n";
if (rioWrite(r,cmd,sizeof(cmd)-1) == 0) return C_ERR;
/* Key and value */
if (rioWriteBulkObject(r,key) == 0) return C_ERR;
if (rioWriteBulkObject(r,o) == 0) return C_ERR;
} else if (o->type == OBJ_LIST) {
if (rewriteListObject(r,key,o) == 0) return C_ERR;
} else if (o->type == OBJ_SET) {
if (rewriteSetObject(r,key,o) == 0) return C_ERR;
} else if (o->type == OBJ_ZSET) {
if (rewriteSortedSetObject(r,key,o) == 0) return C_ERR;
} else if (o->type == OBJ_HASH) {
if (rewriteHashObject(r,key,o) == 0) return C_ERR;
} else if (o->type == OBJ_STREAM) {
if (rewriteStreamObject(r,key,o) == 0) return C_ERR;
} else if (o->type == OBJ_MODULE) {
if (rewriteModuleObject(r,key,o,dbid) == 0) return C_ERR;
} else {
serverPanic("Unknown object type");
}
/* Save the expire time */
if (expiretime != -1) {
static const char cmd[]="*3\r\n$9\r\nPEXPIREAT\r\n";
if (rioWrite(r,cmd,sizeof(cmd)-1) == 0) return C_ERR;
if (rioWriteBulkObject(r,key) == 0) return C_ERR;
if (rioWriteBulkLongLong(r,expiretime) == 0) return C_ERR;
}
return C_OK;
}
int rewriteAppendOnlyFileRio(rio *aof) {
dictEntry *de;
int j;
long key_count = 0;
long long updated_time = 0;
unsigned long long skipped = 0;
kvstoreIterator *kvs_it = NULL;
/* Record timestamp at the beginning of rewriting AOF. */
@ -2420,34 +2458,21 @@ int rewriteAppendOnlyFileRio(rio *aof) {
/* Get the expire time */
expiretime = kvobjGetExpire(o);
/* Skip keys that are being trimmed */
if (server.cluster_enabled) {
int curr_slot = kvstoreIteratorGetCurrentDictIndex(kvs_it);
if (isSlotInTrimJob(curr_slot)) {
skipped++;
continue;
}
}
/* Set on stack string object for key */
robj key;
initStaticStringObject(key, kvobjGetKey(o));
/* Save the key and associated value */
if (o->type == OBJ_STRING) {
/* Emit a SET command */
char cmd[]="*3\r\n$3\r\nSET\r\n";
if (rioWrite(aof,cmd,sizeof(cmd)-1) == 0) goto werr;
/* Key and value */
if (rioWriteBulkObject(aof,&key) == 0) goto werr;
if (rioWriteBulkObject(aof,o) == 0) goto werr;
} else if (o->type == OBJ_LIST) {
if (rewriteListObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_SET) {
if (rewriteSetObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_ZSET) {
if (rewriteSortedSetObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_HASH) {
if (rewriteHashObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_STREAM) {
if (rewriteStreamObject(aof,&key,o) == 0) goto werr;
} else if (o->type == OBJ_MODULE) {
if (rewriteModuleObject(aof,&key,o,j) == 0) goto werr;
} else {
serverPanic("Unknown object type");
}
if (rewriteObject(aof, &key, o, j, expiretime) == C_ERR) goto werr;
/* In fork child process, we can try to release memory back to the
* OS and possibly avoid or decrease COW. We give the dismiss
@ -2455,14 +2480,6 @@ int rewriteAppendOnlyFileRio(rio *aof) {
size_t dump_size = aof->processed_bytes - aof_bytes_before_key;
if (server.in_fork_child) dismissObject(o, dump_size);
/* Save the expire time */
if (expiretime != -1) {
char cmd[]="*3\r\n$9\r\nPEXPIREAT\r\n";
if (rioWrite(aof,cmd,sizeof(cmd)-1) == 0) goto werr;
if (rioWriteBulkObject(aof,&key) == 0) goto werr;
if (rioWriteBulkLongLong(aof,expiretime) == 0) goto werr;
}
/* Update info every 1 second (approximately).
* in order to avoid calling mstime() on each iteration, we will
* check the diff every 1024 keys */
@ -2480,6 +2497,7 @@ int rewriteAppendOnlyFileRio(rio *aof) {
}
kvstoreIteratorRelease(kvs_it);
}
serverLog(LL_NOTICE, "AOF rewrite done, %ld keys saved, %llu keys skipped.", key_count, skipped);
return C_OK;
werr:

View file

@ -76,7 +76,8 @@ void blockClient(client *c, int btype) {
serverAssert(!(c->flags & CLIENT_MASTER &&
btype != BLOCKED_MODULE &&
btype != BLOCKED_LAZYFREE &&
btype != BLOCKED_POSTPONE));
btype != BLOCKED_POSTPONE &&
btype != BLOCKED_POSTPONE_TRIM));
c->flags |= CLIENT_BLOCKED;
c->bstate.btype = btype;
@ -191,7 +192,7 @@ void unblockClient(client *c, int queue_for_reprocessing) {
} else if (c->bstate.btype == BLOCKED_MODULE) {
if (moduleClientIsBlockedOnKeys(c)) unblockClientWaitingData(c);
unblockClientFromModule(c);
} else if (c->bstate.btype == BLOCKED_POSTPONE) {
} else if (c->bstate.btype == BLOCKED_POSTPONE || c->bstate.btype == BLOCKED_POSTPONE_TRIM) {
listDelNode(server.postponed_clients,c->postponed_list_node);
c->postponed_list_node = NULL;
} else if (c->bstate.btype == BLOCKED_SHUTDOWN) {
@ -293,7 +294,7 @@ void disconnectAllBlockedClients(void) {
* command processing will start from scratch, and the command will
* be either executed or rejected. (unlike LIST blocked clients for
* which the command is already in progress in a way. */
if (c->bstate.btype == BLOCKED_POSTPONE)
if (c->bstate.btype == BLOCKED_POSTPONE || c->bstate.btype == BLOCKED_POSTPONE_TRIM)
continue;
if (c->bstate.btype == BLOCKED_LAZYFREE) {
@ -639,15 +640,21 @@ void blockForAofFsync(client *c, mstime_t timeout, long long offset, int numloca
/* Postpone client from executing a command. For example the server might be busy
* requesting to avoid processing clients commands which will be processed later
* when the it is ready to accept them. */
void blockPostponeClient(client *c) {
void blockPostponeClientWithType(client *c, int btype) {
serverAssert(btype == BLOCKED_POSTPONE || btype == BLOCKED_POSTPONE_TRIM);
c->bstate.timeout = 0;
blockClient(c,BLOCKED_POSTPONE);
blockClient(c, btype);
listAddNodeTail(server.postponed_clients, c);
c->postponed_list_node = listLast(server.postponed_clients);
/* Mark this client to execute its command */
c->flags |= CLIENT_PENDING_COMMAND;
}
/* Postpone client from executing a command. */
void blockPostponeClient(client *c) {
blockPostponeClientWithType(c, BLOCKED_POSTPONE);
}
/* Block client due to shutdown command */
void blockClientShutdown(client *c) {
blockClient(c, BLOCKED_SHUTDOWN);

View file

@ -20,6 +20,7 @@
#include "server.h"
#include "cluster.h"
#include "cluster_asm.h"
#include "cluster_slot_stats.h"
#include <ctype.h>
@ -279,7 +280,7 @@ void restoreCommand(client *c) {
objectSetLRUOrLFU(kv, lfu_freq, lru_idle, lru_clock, 1000);
signalModifiedKey(c,c->db,key);
notifyKeyspaceEvent(NOTIFY_GENERIC,"restore",key,c->db->id);
/* If we deleted a key that means REPLACE parameter was passed and the
* destination key existed. */
if (deleted) {
@ -1016,6 +1017,11 @@ void clusterCommand(client *c) {
addReplyError(c,"Invalid slot");
return;
}
if (!clusterCanAccessKeysInSlot(slot)) {
addReplyLongLong(c, 0);
return;
}
addReplyLongLong(c,countKeysInSlot(slot));
} else if (!strcasecmp(c->argv[1]->ptr,"getkeysinslot") && c->argc == 4) {
/* CLUSTER GETKEYSINSLOT <slot> <count> */
@ -1031,6 +1037,11 @@ void clusterCommand(client *c) {
return;
}
if (!clusterCanAccessKeysInSlot(slot)) {
addReplyArrayLen(c, 0);
return;
}
unsigned int keys_in_slot = countKeysInSlot(slot);
unsigned int numkeys = maxkeys > keys_in_slot ? keys_in_slot : maxkeys;
addReplyArrayLen(c,numkeys);
@ -1588,14 +1599,374 @@ void readonlyCommand(client *c) {
addReply(c,shared.ok);
}
void replySlotsFlushAndFree(client *c, SlotsFlush *sflush) {
addReplyArrayLen(c, sflush->numRanges);
for (int i = 0 ; i < sflush->numRanges ; i++) {
addReplyArrayLen(c, 2);
addReplyLongLong(c, sflush->ranges[i].first);
addReplyLongLong(c, sflush->ranges[i].last);
/* Remove all the keys in the specified hash slot.
* The number of removed items is returned. */
unsigned int clusterDelKeysInSlot(unsigned int hashslot, int by_command) {
unsigned int j = 0;
if (!kvstoreDictSize(server.db->keys, (int) hashslot))
return 0;
kvstoreDictIterator *kvs_di = NULL;
dictEntry *de = NULL;
kvs_di = kvstoreGetDictSafeIterator(server.db->keys, (int) hashslot);
while((de = kvstoreDictIteratorNext(kvs_di)) != NULL) {
enterExecutionUnit(1, 0);
sds sdskey = kvobjGetKey(dictGetKV(de));
robj *key = createStringObject(sdskey, sdslen(sdskey));
dbDelete(&server.db[0], key);
signalModifiedKey(NULL, &server.db[0], key);
if (by_command) {
/* Keys are deleted by a command (trimslots), we need to notify the
* keyspace event. Though, we don't need to propagate the DEL
* command, as the command (trimslots) will be propagated. */
notifyKeyspaceEvent(NOTIFY_GENERIC, "del", key, server.db[0].id);
} else {
/* Propagate the DEL command */
propagateDeletion(&server.db[0], key, server.lazyfree_lazy_server_del);
/* The keys are not actually logically deleted from the database,
* just moved to another node. The modules needs to know that these
* keys are no longer available locally, so just send the keyspace
* notification to the modules, but not to clients. */
moduleNotifyKeyspaceEvent(NOTIFY_GENERIC, "del", key, server.db[0].id);
}
exitExecutionUnit();
postExecutionUnitOperations();
decrRefCount(key);
j++;
server.dirty++;
}
zfree(sflush);
kvstoreReleaseDictIterator(kvs_di);
return j;
}
/* Delete the keys in the slot ranges. Returns the number of deleted items */
unsigned int clusterDelKeysInSlotRangeArray(slotRangeArray *slots, int by_command) {
unsigned int j = 0;
for (int i = 0; i < slots->num_ranges; i++) {
for (int slot = slots->ranges[i].start; slot <= slots->ranges[i].end; slot++) {
j += clusterDelKeysInSlot(slot, by_command);
}
}
return j;
}
int clusterIsMySlot(int slot) {
return getMyClusterNode() == getNodeBySlot(slot);
}
void replySlotsFlushAndFree(client *c, slotRangeArray *slots) {
addReplyArrayLen(c, slots->num_ranges);
for (int i = 0 ; i < slots->num_ranges ; i++) {
addReplyArrayLen(c, 2);
addReplyLongLong(c, slots->ranges[i].start);
addReplyLongLong(c, slots->ranges[i].end);
}
slotRangeArrayFree(slots);
}
/* Checks that slot ranges are well-formed and non-overlapping. */
int validateSlotRanges(slotRangeArray *slots, sds *err) {
unsigned char used_slots[CLUSTER_SLOTS] = {0};
if (slots->num_ranges <= 0 || slots->num_ranges >= CLUSTER_SLOTS) {
*err = sdscatprintf(sdsempty(), "invalid number of slot ranges: %d", slots->num_ranges);
return C_ERR;
}
for (int i = 0; i < slots->num_ranges; i++) {
if (slots->ranges[i].start >= CLUSTER_SLOTS ||
slots->ranges[i].end >= CLUSTER_SLOTS)
{
*err = sdscatprintf(sdsempty(), "slot range is out of range: %d-%d",
slots->ranges[i].start, slots->ranges[i].end);
return C_ERR;
}
if (slots->ranges[i].start > slots->ranges[i].end) {
*err = sdscatprintf(sdsempty(), "start slot number %d is greater than end slot number %d",
slots->ranges[i].start, slots->ranges[i].end);
return C_ERR;
}
for (int j = slots->ranges[i].start; j <= slots->ranges[i].end; j++) {
if (used_slots[j]) {
*err = sdscatprintf(sdsempty(), "Slot %d specified multiple times", j);
return C_ERR;
}
used_slots[j]++;
}
}
return C_OK;
}
/* Create a slot range array with the specified number of ranges. */
slotRangeArray *slotRangeArrayCreate(int num_ranges) {
slotRangeArray *slots = zcalloc(sizeof(slotRangeArray) + num_ranges * sizeof(slotRange));
slots->num_ranges = num_ranges;
return slots;
}
/* Duplicate the slot range array. */
slotRangeArray *slotRangeArrayDup(slotRangeArray *slots) {
slotRangeArray *dup = slotRangeArrayCreate(slots->num_ranges);
memcpy(dup->ranges, slots->ranges, sizeof(slotRange) * slots->num_ranges);
return dup;
}
/* Set the slot range at the specified index. */
void slotRangeArraySet(slotRangeArray *slots, int idx, int start, int end) {
slots->ranges[idx].start = start;
slots->ranges[idx].end = end;
}
/* Create a slot range string in the format of: "1000-2000 3000-4000 ..." */
sds slotRangeArrayToString(slotRangeArray *slots) {
sds s = sdsempty();
for (int i = 0; i < slots->num_ranges; i++) {
slotRange *sr = &slots->ranges[i];
s = sdscatprintf(s, "%d-%d ", sr->start, sr->end);
}
sdssetlen(s, sdslen(s) - 1);
s[sdslen(s)] = '\0';
return s;
}
/* Parse a slot range string in the format "1000-2000 3000-4000 ..." into a slotRangeArray.
* Returns a new slotRangeArray on success, NULL on failure. */
slotRangeArray *slotRangeArrayFromString(sds data) {
int num_ranges;
long long start, end;
slotRangeArray *slots = NULL;
if (!data || sdslen(data) == 0) return NULL;
sds *parts = sdssplitlen(data, sdslen(data), " ", 1, &num_ranges);
if (num_ranges <= 0) goto err;
slots = slotRangeArrayCreate(num_ranges);
/* Parse each slot range */
for (int i = 0; i < num_ranges; i++) {
char *dash = strchr(parts[i], '-');
if (!dash) goto err;
if (string2ll(parts[i], dash - parts[i], &start) == 0 ||
string2ll(dash + 1, sdslen(parts[i]) - (dash - parts[i]) - 1, &end) == 0)
goto err;
slotRangeArraySet(slots, i, start, end);
}
/* Validate all ranges */
sds err_msg = NULL;
if (validateSlotRanges(slots, &err_msg) != C_OK) {
if (err_msg) sdsfree(err_msg);
goto err;
}
sdsfreesplitres(parts, num_ranges);
return slots;
err:
if (slots) slotRangeArrayFree(slots);
sdsfreesplitres(parts, num_ranges);
return NULL;
}
static int compareSlotRange(const void *a, const void *b) {
const slotRange *sa = a;
const slotRange *sb = b;
if (sa->start < sb->start) return -1;
if (sa->start > sb->start) return 1;
return 0;
}
/* Compare two slot range arrays, return 1 if equal, 0 otherwise */
int slotRangeArrayIsEqual(slotRangeArray *slots1, slotRangeArray *slots2) {
if (slots1->num_ranges != slots2->num_ranges) return 0;
/* Sort slot ranges first */
qsort(slots1->ranges, slots1->num_ranges, sizeof(slotRange), compareSlotRange);
qsort(slots2->ranges, slots2->num_ranges, sizeof(slotRange), compareSlotRange);
for (int i = 0; i < slots1->num_ranges; i++) {
if (slots1->ranges[i].start != slots2->ranges[i].start ||
slots1->ranges[i].end != slots2->ranges[i].end) {
return 0;
}
}
return 1;
}
/* Add a slot to the slot range array.
* Usage:
* slotRangeArray *slots = NULL
* slots = slotRangeArrayAppend(slots, 1000);
* slots = slotRangeArrayAppend(slots, 1001);
* slots = slotRangeArrayAppend(slots, 1003);
* slots = slotRangeArrayAppend(slots, 1004);
* slots = slotRangeArrayAppend(slots, 1005);
*
* Result: 1000-1001, 1003-1005
* Note: `slot` must be greater than the previous slot.
* */
slotRangeArray *slotRangeArrayAppend(slotRangeArray *slots, int slot) {
if (slots == NULL) {
slots = slotRangeArrayCreate(4);
slots->ranges[0].start = slot;
slots->ranges[0].end = slot;
slots->num_ranges = 1;
return slots;
}
serverAssert(slots->num_ranges >= 0 && slots->num_ranges <= CLUSTER_SLOTS);
serverAssert(slot > slots->ranges[slots->num_ranges - 1].end);
/* Check if we can extend the last range */
slotRange *last = &slots->ranges[slots->num_ranges - 1];
if (slot == last->end + 1) {
last->end = slot;
return slots;
}
/* Calculate current capacity and reallocate if needed */
int cap = (int) ((zmalloc_size(slots) - sizeof(slotRangeArray)) / sizeof(slotRange));
if (slots->num_ranges >= cap)
slots = zrealloc(slots, sizeof(slotRangeArray) + sizeof(slotRange) * cap * 2);
/* Add new single-slot range */
slots->ranges[slots->num_ranges].start = slot;
slots->ranges[slots->num_ranges].end = slot;
slots->num_ranges++;
return slots;
}
/* Returns 1 if the slot range array contains the given slot, 0 otherwise. */
int slotRangeArrayContains(slotRangeArray *slots, unsigned int slot) {
for (int i = 0; i < slots->num_ranges; i++)
if (slots->ranges[i].start <= slot && slots->ranges[i].end >= slot)
return 1;
return 0;
}
/* Free the slot range array. */
void slotRangeArrayFree(slotRangeArray *slots) {
zfree(slots);
}
/* Generic version of slotRangeArrayFree(). */
void slotRangeArrayFreeGeneric(void *slots) {
slotRangeArrayFree(slots);
}
/* Slot range array iterator */
slotRangeArrayIter *slotRangeArrayGetIterator(slotRangeArray *slots) {
slotRangeArrayIter *it = zmalloc(sizeof(*it));
it->slots = slots;
it->range_index = 0;
it->cur_slot = slots->num_ranges > 0 ? slots->ranges[0].start : -1;
return it;
}
/* Returns the next slot in the array, or -1 if there are no more slots. */
int slotRangeArrayNext(slotRangeArrayIter *it) {
if (it->range_index >= it->slots->num_ranges) return -1;
if (it->cur_slot < it->slots->ranges[it->range_index].end) {
it->cur_slot++;
} else {
it->range_index++;
if (it->range_index < it->slots->num_ranges)
it->cur_slot = it->slots->ranges[it->range_index].start;
else
it->cur_slot = -1; /* finished */
}
return it->cur_slot;
}
int slotRangeArrayGetCurrentSlot(slotRangeArrayIter *it) {
return it->cur_slot;
}
void slotRangeArrayIteratorFree(slotRangeArrayIter *it) {
zfree(it);
}
/* Parse slot ranges from the command arguments. Returns NULL on error. */
slotRangeArray *parseSlotRangesOrReply(client *c, int argc, int pos) {
int start, end, count;
slotRangeArray *slots;
serverAssert(pos <= argc);
serverAssert((argc - pos) % 2 == 0);
count = (argc - pos) / 2;
slots = slotRangeArrayCreate(count);
slots->num_ranges = 0;
for (int j = pos; j < argc; j += 2) {
if ((start = getSlotOrReply(c, c->argv[j])) == -1 ||
(end = getSlotOrReply(c, c->argv[j + 1])) == -1)
{
slotRangeArrayFree(slots);
return NULL;
}
slotRangeArraySet(slots, slots->num_ranges, start, end);
slots->num_ranges++;
}
sds err = NULL;
if (validateSlotRanges(slots, &err) != C_OK) {
addReplyErrorSds(c, err);
slotRangeArrayFree(slots);
return NULL;
}
return slots;
}
/* Return 1 if the keys in the slot can be accessed, 0 otherwise. */
int clusterCanAccessKeysInSlot(int slot) {
/* If not in cluster mode, all keys are accessible */
if (server.cluster_enabled == 0) return 1;
/* If the slot is being imported under old slot migration approach, we should
* allow to list keys from the slot as previously. */
if (getImportingSlotSource(slot)) return 1;
/* If using atomic slot migration, check if the slot belongs to the current
* node or its master, return 1 if so. */
clusterNode *myself = getMyClusterNode();
if (clusterNodeIsSlave(myself)) {
clusterNode *master = clusterNodeGetMaster(myself);
if (master && clusterNodeCoversSlot(master, slot))
return 1;
} else {
if (clusterNodeCoversSlot(myself, slot))
return 1;
}
return 0;
}
/* Return the slot ranges that belong to the current node or its master. */
slotRangeArray *clusterGetLocalSlotRanges(void) {
slotRangeArray *slots = NULL;
if (!server.cluster_enabled) {
slots = slotRangeArrayCreate(1);
slotRangeArraySet(slots, 0, 0, CLUSTER_SLOTS - 1);
return slots;
}
clusterNode *master = clusterNodeGetMaster(getMyClusterNode());
if (master) {
for (int i = 0; i < CLUSTER_SLOTS; i++) {
if (clusterNodeCoversSlot(master, i))
slots = slotRangeArrayAppend(slots, i);
}
}
return slots ? slots : slotRangeArrayCreate(0);
}
/* Partially flush destination DB in a cluster node, based on the slot range.
@ -1635,77 +2006,44 @@ void sflushCommand(client *c) {
return;
}
/* Verify <first, last> slot pairs are valid and not overlapping */
long long j, first, last;
unsigned char slotsToFlushRq[CLUSTER_SLOTS] = {0};
for (j = 1; j < argc; j += 2) {
/* check if the first slot is valid */
if (getLongLongFromObject(c->argv[j], &first) != C_OK || first < 0 || first >= CLUSTER_SLOTS) {
addReplyError(c,"Invalid or out of range slot");
return;
}
/* Parse slot ranges from the command arguments. */
slotRangeArray *slots = parseSlotRangesOrReply(c, argc, 1);
if (!slots) return;
/* check if the last slot is valid */
if (getLongLongFromObject(c->argv[j+1], &last) != C_OK || last < 0 || last >= CLUSTER_SLOTS) {
addReplyError(c,"Invalid or out of range slot");
return;
}
if (first > last) {
addReplyErrorFormat(c,"start slot number %lld is greater than end slot number %lld", first, last);
return;
}
/* Mark the slots in slotsToFlushRq[] */
for (int i = first; i <= last; i++) {
if (slotsToFlushRq[i]) {
addReplyErrorFormat(c, "Slot %d specified multiple times", i);
return;
/* Iterate and find the slot ranges that belong to this node. Save them in
* a new slotRangeArray. It is allocated on heap since there is a chance
* that FLUSH SYNC will be running as blocking ASYNC and only later reply
* with slot ranges */
unsigned char slots_to_flush[CLUSTER_SLOTS] = {0}; /* Requested slots to flush */
slotRangeArray *myslots = NULL;
for (int i = 0; i < slots->num_ranges; i++) {
for (int j = slots->ranges[i].start; j <= slots->ranges[i].end; j++) {
if (clusterIsMySlot(j)) {
myslots = slotRangeArrayAppend(myslots, j);
slots_to_flush[j] = 1;
}
slotsToFlushRq[i] = 1;
}
}
/* Verify slotsToFlushRq[] covers ALL slots of myNode. */
clusterNode *myNode = getMyClusterNode();
/* During iteration trace also the slot range pairs and save in SlotsFlush.
* It is allocated on heap since there is a chance that FLUSH SYNC will be
* running as blocking ASYNC and only later reply with slot ranges */
int capacity = 32; /* Initial capacity */
SlotsFlush *sflush = zmalloc(sizeof(SlotsFlush) + sizeof(SlotRange) * capacity);
sflush->numRanges = 0;
int inSlotRange = 0;
/* Verify that all slots of mynode got covered. See sflushCommand() comment. */
int all_slots_covered = 1;
for (int i = 0; i < CLUSTER_SLOTS; i++) {
if (myNode == getNodeBySlot(i)) {
if (!slotsToFlushRq[i]) {
addReplySetLen(c, 0); /* Not all slots of mynode got covered. See sflushCommand() comment. */
zfree(sflush);
return;
}
if (!inSlotRange) { /* If start another slot range */
sflush->ranges[sflush->numRanges].first = i;
inSlotRange = 1;
}
} else {
if (inSlotRange) { /* If end another slot range */
sflush->ranges[sflush->numRanges++].last = i - 1;
inSlotRange = 0;
/* If reached 'sflush' capacity, double the capacity */
if (sflush->numRanges >= capacity) {
capacity *= 2;
sflush = zrealloc(sflush, sizeof(SlotsFlush) + sizeof(SlotRange) * capacity);
}
}
if (clusterIsMySlot(i) && !slots_to_flush[i]) {
all_slots_covered = 0;
break;
}
}
if (myslots == NULL || !all_slots_covered) {
addReplyArrayLen(c, 0);
slotRangeArrayFree(slots);
slotRangeArrayFree(myslots);
return;
}
slotRangeArrayFree(slots);
/* Update last pair if last cluster slot is also end of last range */
if (inSlotRange) sflush->ranges[sflush->numRanges++].last = CLUSTER_SLOTS - 1;
/* Flush selected slots. If not flush as blocking async, then reply immediately */
if (flushCommandCommon(c, FLUSH_TYPE_SLOTS, flags, sflush) == 0)
replySlotsFlushAndFree(c, sflush);
if (flushCommandCommon(c, FLUSH_TYPE_SLOTS, flags, myslots) == 0)
replySlotsFlushAndFree(c, myslots);
}
/* The READWRITE command just clears the READONLY command state. */

View file

@ -153,6 +153,9 @@ clusterNode *clusterLookupNode(const char *name, int length);
const char *clusterGetSecret(size_t *len);
unsigned int countKeysInSlot(unsigned int slot);
int getSlotOrReply(client *c, robj *o);
int clusterIsMySlot(int slot);
int clusterCanAccessKeysInSlot(int slot);
struct slotRangeArray *clusterGetLocalSlotRanges(void);
/* functions with shared implementations */
clusterNode *getNodeByQuery(client *c, struct redisCommand *cmd, robj **argv, int argc, int *hashslot, uint64_t cmd_flags, int *error_code);
@ -160,11 +163,44 @@ int clusterRedirectBlockedClientIfNeeded(client *c);
void clusterRedirectClient(client *c, clusterNode *n, int hashslot, int error_code);
void migrateCloseTimedoutSockets(void);
int patternHashSlot(char *pattern, int length);
int getSlotOrReply(client *c, robj *o);
int isValidAuxString(char *s, unsigned int length);
void migrateCommand(client *c);
void clusterCommand(client *c);
ConnectionType *connTypeOfCluster(void);
typedef struct slotRange {
unsigned short start, end;
} slotRange;
typedef struct slotRangeArray {
int num_ranges;
slotRange ranges[];
} slotRangeArray;
typedef struct slotRangeArrayIter {
slotRangeArray *slots; /* the array were iterating */
int range_index; /* current range index */
int cur_slot; /* current slot within the range */
} slotRangeArrayIter;
slotRangeArray *slotRangeArrayCreate(int num_ranges);
slotRangeArray *slotRangeArrayDup(slotRangeArray *slots);
void slotRangeArraySet(slotRangeArray *slots, int idx, int start, int end);
sds slotRangeArrayToString(slotRangeArray *slots);
slotRangeArray *slotRangeArrayFromString(sds data);
int slotRangeArrayIsEqual(slotRangeArray *slots1, slotRangeArray *slots2);
slotRangeArray *slotRangeArrayAppend(slotRangeArray *slots, int slot);
int slotRangeArrayContains(slotRangeArray *slots, unsigned int slot);
void slotRangeArrayFree(slotRangeArray *slots);
void slotRangeArrayFreeGeneric(void *slots);
slotRangeArrayIter *slotRangeArrayGetIterator(slotRangeArray *slots);
int slotRangeArrayNext(slotRangeArrayIter *it);
int slotRangeArrayGetCurrentSlot(slotRangeArrayIter *it);
void slotRangeArrayIteratorFree(slotRangeArrayIter *it);
int validateSlotRanges(slotRangeArray *slots, sds *err);
slotRangeArray *parseSlotRangesOrReply(client *c, int argc, int pos);
unsigned int clusterDelKeysInSlot(unsigned int hashslot, int by_command);
unsigned int clusterDelKeysInSlotRangeArray(slotRangeArray *slots, int by_command);
void clusterGenNodesSlotsInfo(int filter);
void clusterFreeNodesSlotsInfo(clusterNode *n);
int clusterNodeSlotInfoCount(clusterNode *n);
@ -184,4 +220,136 @@ clusterNode *clusterShardNodeFirst(void *shard);
int clusterNodeTcpPort(clusterNode *node);
int clusterNodeTlsPort(clusterNode *node);
/* API for alternative cluster implementations to start and coordinate
* Atomic Slot Migration (ASM).
*
* These two functions drive ASM for alternative cluster implementations.
* - clusterAsmProcess(...) impl -> redis: initiates/advances/cancels ASM operations
* - clusterAsmOnEvent(...) redis -> impl: notifies state changes
*
* Generic steps for an alternative implementation:
* - On destination side, implementation calls clusterAsmProcess(ASM_EVENT_IMPORT_START)
* to start an import operation.
* - Redis calls clusterAsmOnEvent() when an ASM event occurs.
* - On the source side, Redis will call clusterAsmOnEvent(ASM_EVENT_HANDOFF_PREP)
* when slots are ready to be handed off and the write pause is needed.
* - Implementation stops the traffic to the slots and calls clusterAsmProcess(ASM_EVENT_HANDOFF)
* - On the destination side, Redis calls clusterAsmOnEvent(ASM_EVENT_TAKEOVER)
* when destination node is ready to take over the slot, waiting for ownership change.
* - Cluster implementation updates the config and calls clusterAsmProcess(ASM_EVENT_DONE)
* to notify Redis that the slots ownership has changed.
*
* Sequence diagram for import:
* - Note: shows only the events that cluster implementation needs to react.
*
*
* Destination Destination Source Source
* Cluster impl Master Master Cluster impl
*
*
* ASM_EVENT_IMPORT_START
*
* CLUSTER SYNCSLOTS <arg>
*
*
* SNAPSHOT(restore cmds)
*
* Repl stream
*
* ASM_EVENT_HANDOFF_PREP
*
* ASM_EVENT_HANDOFF
*
* Drain repl stream
*
* ASM_EVENT_TAKEOVER
*
*
* ASM_EVENT_DONE
* ASM_EVENT_DONE
*
*
*/
#define ASM_EVENT_IMPORT_START 1 /* Start a new import operation (destination side) */
#define ASM_EVENT_CANCEL 2 /* Cancel an ongoing import/migrate operation (source and destination side) */
#define ASM_EVENT_HANDOFF_PREP 3 /* Slot is ready to be handed off to the destination shard (source side) */
#define ASM_EVENT_HANDOFF 4 /* Notify that the slot can be handed off (source side) */
#define ASM_EVENT_TAKEOVER 5 /* Ready to take over the slot, waiting for config change (destination side) */
#define ASM_EVENT_DONE 6 /* Notify that import/migrate is completed, config is updated (source and destination side) */
#define ASM_EVENT_IMPORT_PREP 7 /* Import is about to start, the implementation may reject by returning C_ERR */
#define ASM_EVENT_IMPORT_STARTED 8 /* Import started */
#define ASM_EVENT_IMPORT_FAILED 9 /* Import failed */
#define ASM_EVENT_IMPORT_COMPLETED 10 /* Import completed (config updated) */
#define ASM_EVENT_MIGRATE_PREP 11 /* Migrate is about to start, the implementation may reject by returning C_ERR */
#define ASM_EVENT_MIGRATE_STARTED 12 /* Migrate started */
#define ASM_EVENT_MIGRATE_FAILED 13 /* Migrate failed */
#define ASM_EVENT_MIGRATE_COMPLETED 14 /* Migrate completed (config updated) */
/* Called by cluster implementation to request an ASM operation. (cluster impl --> redis)
* Valid values for 'event':
* ASM_EVENT_IMPORT_START
* ASM_EVENT_CANCEL
* ASM_EVENT_HANDOFF
* ASM_EVENT_DONE
*
* For ASM_EVENT_IMPORT_START, 'task_id' should be a unique string.
* For other events (ASM_EVENT_CANCEL, ASM_EVENT_HANDOFF, ASM_EVENT_DONE),
* 'task_id' should match the ID from the corresponding import operation.
* Usage:
* char *task_id = malloc(CLUSTER_NAMELEN + 1);
* getRandomHexChars(task_id, CLUSTER_NAMELEN);
* task_id[CLUSTER_NAMELEN] = '\0';
*
* slotRangeArray *slots = slotRangeArrayCreate(1);
* slotRangeArraySet(slots, 0, 0, 1000);
*
* const char *err = NULL;
* int ret = clusterAsmProcess(task_id, ASM_EVENT_IMPORT_START, slots, &err);
* zfree(task_id);
* slotRangeArrayFree(slots);
*
* if (ret != C_OK) {
* perror(err);
* return;
* }
*
* For ASM_EVENT_CANCEL, if `task_id` is NULL, all tasks will be cancelled.
* If `arg` parameter is provided, it should be a pointer to an int. It will be
* set to the number of tasks cancelled.
*
* Return value:
* - Returns C_OK on success, C_ERR on failure and 'err' will be set to the
* error message.
*
* Memory management:
* - There is no ownership transfer of 'task_id', 'err' or `slotRangeArray`.
* - `task_id` and `slotRangeArray` should be allocated and be freed by the
* caller. Redis internally will make a copy of these.
* - `err` is allocated by Redis and should NOT be freed by the caller.
**/
int clusterAsmProcess(const char *task_id, int event, void *arg, char **err);
/* Called when an ASM event occurs to notify the cluster implementation. (redis --> cluster impl)
*
* `arg` will point to a `slotRangeArray` for the following events:
* ASM_EVENT_IMPORT_PREP
* ASM_EVENT_IMPORT_STARTED
* ASM_EVENT_MIGRATE_PREP
* ASM_EVENT_MIGRATE_STARTED
* ASM_EVENT_HANDOFF_PREP
*
* Memory management:
* - Redis owns the `task_id` and `slotRangeArray`.
*
* Returns C_OK on success.
*
* If the cluster implementation returns C_ERR for ASM_EVENT_IMPORT_PREP or
* ASM_EVENT_MIGRATE_PREP, operation will not start.
**/
int clusterAsmOnEvent(const char *task_id, int event, void *arg);
#endif /* __CLUSTER_H */

3467
src/cluster_asm.c Normal file

File diff suppressed because it is too large Load diff

56
src/cluster_asm.h Normal file
View file

@ -0,0 +1,56 @@
/*
* Copyright (c) 2025-Present, Redis Ltd.
* All rights reserved.
*
* Licensed under your choice of (a) the Redis Source Available License 2.0
* (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
* GNU Affero General Public License v3 (AGPLv3).
*/
#ifndef CLUSTER_ASM_H
#define CLUSTER_ASM_H
struct asmTask;
struct slotRangeArray;
struct slotRange;
void asmInit(void);
void asmBeforeSleep(void);
void asmCron(void);
void asmSlotSnapshotAndStreamStart(struct asmTask *task);
void asmSlotSnapshotSucceed(struct asmTask *task);
void asmSlotSnapshotFailed(struct asmTask *task);
void asmCallbackOnFreeClient(client *c);
int asmMigrateInProgress(void);
int asmImportInProgress(void);
void asmFeedMigrationClient(robj **argv, int argc);
int asmDebugSetFailPoint(char * channel, char *state);
int asmDebugSetTrimMethod(const char *method, int active_trim_delay);
void asmImportIncrAppliedBytes(struct asmTask *task, size_t bytes);
struct slotRangeArray *asmTaskGetSlotRanges(const char *task_id);
int asmNotifyConfigUpdated(struct asmTask *task, sds *err);
size_t asmGetPeakSyncBufferSize(void);
size_t asmGetImportInputBufferSize(void);
size_t asmGetMigrateOutputBufferSize(void);
int clusterAsmCancel(const char *task_id, const char *reason);
int clusterAsmCancelBySlot(int slot, const char *reason);
int clusterAsmCancelBySlotRangeArray(struct slotRangeArray *slots, const char *reason);
int clusterAsmCancelByNode(void *node, const char *reason);
int isSlotInAsmTask(int slot);
int isSlotInTrimJob(int slot);
sds asmCatInfoString(sds info);
void clusterMigrationCommand(client *c);
void clusterSyncSlotsCommand(client *c);
struct asmTask *asmLookupTaskBySlotRangeArray(struct slotRangeArray *slots);
void asmCancelTrimJobs(void);
sds asmDumpActiveImportTask(void);
int asmReplicaHandleMasterTask(sds task_info);
void asmFinalizeMasterTask(void);
int asmIsTrimInProgress(void);
int asmGetTrimmingSlotForCommand(struct redisCommand *cmd, robj **argv, int argc);
void asmActiveTrimCycle(void);
int asmActiveTrimDelIfNeeded(redisDb *db, robj *key, kvobj *kv);
int asmModulePropagateBeforeSlotSnapshot(struct redisCommand *cmd, robj **argv, int argc);
#endif

View file

@ -20,6 +20,7 @@
#include "server.h"
#include "cluster.h"
#include "cluster_legacy.h"
#include "cluster_asm.h"
#include "cluster_slot_stats.h"
#include "endianconv.h"
#include "connection.h"
@ -76,7 +77,7 @@ const char *clusterGetMessageTypeString(int type);
void removeChannelsInSlot(unsigned int slot);
unsigned int countKeysInSlot(unsigned int hashslot);
unsigned int countChannelsInSlot(unsigned int hashslot);
unsigned int delKeysInSlot(unsigned int hashslot);
unsigned int clusterDelKeysInSlot(unsigned int hashslot, int flags);
void clusterAddNodeToShard(const char *shard_id, clusterNode *node);
list *clusterLookupNodeListByShardId(const char *shard_id);
void clusterRemoveNodeFromShard(clusterNode *node);
@ -1034,6 +1035,7 @@ void clusterInit(void) {
resetClusterStats();
getRandomHexChars(server.cluster->internal_secret, CLUSTER_INTERNALSECRETLEN);
asmInit();
}
void clusterInitLast(void) {
@ -1076,6 +1078,7 @@ void clusterReset(int hard) {
/* Turn into master. */
if (nodeIsSlave(myself)) {
asmFinalizeMasterTask();
clusterSetNodeAsMaster(myself);
replicationUnsetMaster();
emptyData(-1,EMPTYDB_NO_FLAGS,NULL);
@ -1085,6 +1088,10 @@ void clusterReset(int hard) {
clusterCloseAllSlots();
resetManualFailover();
/* Cancel all ASM tasks */
clusterAsmCancel(NULL, "CLUSTER RESET");
asmCancelTrimJobs();
/* Unassign all the slots. */
for (j = 0; j < CLUSTER_SLOTS; j++) clusterDelSlot(j);
@ -1539,7 +1546,8 @@ void clusterAddNode(clusterNode *node) {
* 2) Remove all the failure reports sent by this node and referenced by
* other nodes.
* 3) Remove the node from the owning shard
* 4) Free the node with freeClusterNode() that will in turn remove it
* 4) Cancel all ASM tasks that involve the node.
* 5) Free the node with freeClusterNode() that will in turn remove it
* from the hash table and from the list of slaves of its master, if
* it is a slave node.
*/
@ -1571,7 +1579,10 @@ void clusterDelNode(clusterNode *delnode) {
/* 3) Remove the node from the owning shard */
clusterRemoveNodeFromShard(delnode);
/* 4) Free the node, unlinking it from the cluster. */
/* 4) Cancel all ASM tasks that involve the node. */
clusterAsmCancelByNode(delnode, "node deleted");
/* 5) Free the node, unlinking it from the cluster. */
freeClusterNode(delnode);
}
@ -2356,6 +2367,7 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
return;
}
slotRangeArray *sra = NULL;
for (j = 0; j < CLUSTER_SLOTS; j++) {
if (bitmapTestBit(slots,j)) {
sender_slots++;
@ -2379,6 +2391,13 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch)
{
/* After completing slot ranges migration, the destination node
* will broadcast a PONG message to all the nodes. We need to
* detect that the slot was moved from us to the sender, and
* call asmNotifyConfigUpdated() to notify the ASM state machine. */
if (server.cluster->slots[j] == myself && sender != myself)
sra = slotRangeArrayAppend(sra, j);
/* Was this slot mine, and still contains keys? Mark it as
* a dirty slot. */
if (server.cluster->slots[j] == myself &&
@ -2411,6 +2430,24 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
}
}
/* Notify ASM about the config update */
struct asmTask *asm_task = NULL;
if (sra && sra->num_ranges > 0 && server.masterhost == NULL) {
sds err = NULL;
asm_task = asmLookupTaskBySlotRangeArray(sra);
if (!asm_task) {
/* If no task was found, it means the config update is not related
* to current ASM task, but this node learned about the config
* update from cluster protocol, and we need to cancel any
* conflicting tasks that overlap with the slot ranges. */
clusterAsmCancelBySlotRangeArray(sra, "slots configuration updated");
} else if (asmNotifyConfigUpdated(asm_task, &err) != C_OK) {
serverLog(LL_WARNING, "ASM config update failed: %s", err);
sdsfree(err);
}
}
slotRangeArrayFree(sra);
/* After updating the slots configuration, don't do any actual change
* in the state of the server if a module disabled Redis Cluster
* keys redirections. */
@ -2451,7 +2488,7 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
CLUSTER_TODO_UPDATE_STATE|
CLUSTER_TODO_FSYNC_CONFIG);
} else if (dirty_slots_count) {
} else if (dirty_slots_count && !asm_task) {
/* If we are here, we received an update message which removed
* ownership for certain slots we still have keys about, but still
* we are serving some slots, so this master node was not demoted to
@ -2460,7 +2497,7 @@ void clusterUpdateSlotsConfigWith(clusterNode *sender, uint64_t senderConfigEpoc
* In order to maintain a consistent state between keys and slots
* we need to remove all the keys from the slots we lost. */
for (j = 0; j < dirty_slots_count; j++)
delKeysInSlot(dirty_slots[j]);
clusterDelKeysInSlot(dirty_slots[j], 0);
}
}
@ -2656,6 +2693,7 @@ void clusterProcessPingExtensions(clusterMsg *hdr, clusterLink *link) {
if (n && n != myself && !(nodeIsSlave(myself) && myself->slaveof == n)) {
sds id = sdsnewlen(forgotten_node_ext->name, CLUSTER_NAMELEN);
dictEntry *de = dictAddOrFind(server.cluster->nodes_black_list, id);
if (dictGetKey(de) != id) sdsfree(id);
uint64_t expire = server.unixtime + ntohu64(forgotten_node_ext->ttl);
dictSetUnsignedIntegerVal(de, expire);
clusterDelNode(n);
@ -3253,6 +3291,8 @@ int clusterProcessPacket(clusterLink *link) {
/* This message is acceptable only if I'm a master and the sender
* is one of my slaves. */
if (!sender || sender->slaveof != myself) return 1;
/* Cancel all ASM tasks when starting manual failover */
clusterAsmCancel(NULL, "manual failover");
/* Manual failover requested from slaves. Initialize the state
* accordingly. */
resetManualFailover();
@ -4232,6 +4272,9 @@ void clusterFailoverReplaceYourMaster(void) {
/* 5) If there was a manual failover in progress, clear the state. */
resetManualFailover();
/* 6) Handle the ASM task from previous master. */
asmFinalizeMasterTask();
}
/* This function is called if we are a slave node and our master serving
@ -4875,6 +4918,9 @@ void clusterCron(void) {
if (update_state || server.cluster->state == CLUSTER_FAIL)
clusterUpdateState();
/* Atomic slot migration cron */
asmCron();
}
/* This function is called before the event handler returns to sleep for
@ -4912,6 +4958,8 @@ void clusterBeforeSleep(void) {
int fsync = flags & CLUSTER_TODO_FSYNC_CONFIG;
clusterSaveConfigOrDie(fsync);
}
asmBeforeSleep();
}
void clusterDoBeforeSleep(int flags) {
@ -5247,8 +5295,13 @@ int verifyClusterConfigWithData(void) {
} else {
serverLog(LL_NOTICE, "I have keys for slot %d, but the slot is "
"assigned to another node. "
"Setting it to importing state.",j);
server.cluster->importing_slots_from[j] = server.cluster->slots[j];
"Deleting keys in the slot.", j);
/* With atomic slot migration, it is safe to drop keys from slots
* that are not owned. This will not result in data loss under the
* legacy slot migration approach either, since the importing state
* has already been persisted in node.conf. */
clusterDelKeysInSlot(j, 0);
}
}
if (update_config) clusterSaveConfigOrDie(1);
@ -5276,7 +5329,8 @@ void clusterSetMaster(clusterNode *n) {
serverAssert(n != myself);
serverAssert(myself->numslots == 0);
if (clusterNodeIsMaster(myself)) {
int was_master = clusterNodeIsMaster(myself);
if (was_master) {
myself->flags &= ~(CLUSTER_NODE_MASTER|CLUSTER_NODE_MIGRATE_TO);
myself->flags |= CLUSTER_NODE_SLAVE;
clusterCloseAllSlots();
@ -5290,6 +5344,9 @@ void clusterSetMaster(clusterNode *n) {
replicationSetMaster(n->ip, getNodeDefaultReplicationPort(n));
removeAllNotOwnedShardChannelSubscriptions();
resetManualFailover();
/* Cancel all ASM tasks when switching into slave */
if (was_master) clusterAsmCancel(NULL, "switching to replica");
}
/* -----------------------------------------------------------------------------
@ -5638,6 +5695,9 @@ void clusterUpdateSlots(client *c, unsigned char *slots, int del) {
if (server.cluster->importing_slots_from[j])
server.cluster->importing_slots_from[j] = NULL;
/* Cancel any ASM task that overlaps with the slot. */
clusterAsmCancelBySlot(j, "slots configuration updated");
retval = del ? clusterDelSlot(j) :
clusterAddSlot(myself,j);
serverAssertWithInfo(c,NULL,retval == C_OK);
@ -5784,6 +5844,8 @@ sds genClusterInfoString(void) {
"total_cluster_links_buffer_limit_exceeded:%llu\r\n",
server.cluster->stat_cluster_links_buffer_limit_exceeded);
info = asmCatInfoString(info);
return info;
}
@ -5794,39 +5856,6 @@ void removeChannelsInSlot(unsigned int slot) {
pubsubShardUnsubscribeAllChannelsInSlot(slot);
}
/* Remove all the keys in the specified hash slot.
* The number of removed items is returned. */
unsigned int delKeysInSlot(unsigned int hashslot) {
if (!kvstoreDictSize(server.db->keys, hashslot))
return 0;
unsigned int j = 0;
kvstoreDictIterator *kvs_di = NULL;
dictEntry *de = NULL;
kvs_di = kvstoreGetDictSafeIterator(server.db->keys, hashslot);
while((de = kvstoreDictIteratorNext(kvs_di)) != NULL) {
enterExecutionUnit(1, 0);
sds sdskey = kvobjGetKey(dictGetKV(de));
robj *key = createStringObject(sdskey, sdslen(sdskey));
dbDelete(&server.db[0], key);
propagateDeletion(&server.db[0], key, server.lazyfree_lazy_server_del);
signalModifiedKey(NULL, &server.db[0], key);
/* The keys are not actually logically deleted from the database, just moved to another node.
* The modules needs to know that these keys are no longer available locally, so just send the
* keyspace notification to the modules, but not to clients. */
moduleNotifyKeyspaceEvent(NOTIFY_GENERIC, "del", key, server.db[0].id);
exitExecutionUnit();
postExecutionUnitOperations();
decrRefCount(key);
j++;
server.dirty++;
}
kvstoreReleaseDictIterator(kvs_di);
return j;
}
/* Get the count of the channels for a given slot. */
unsigned int countChannelsInSlot(unsigned int hashslot) {
return kvstoreDictSize(server.pubsubshard_channels, hashslot);
@ -6090,6 +6119,22 @@ int clusterCommandSpecial(client *c) {
if ((slot = getSlotOrReply(c, c->argv[2])) == -1) return 1;
/* Don't allow legacy slot migration if the slot is in an ASM task. */
if (isSlotInAsmTask(slot)) {
addReplyErrorFormat(c, "Slot %d is currently in an active atomic slot migration. "
"CLUSTER SETSLOT cannot be used at this time. To perform a legacy slot migration "
"instead, first cancel the ongoing task with CLUSTER MIGRATION CANCEL", slot);
return 1;
}
if (isSlotInTrimJob(slot)) {
addReplyErrorFormat(c, "There is a pending trim job for slot %d. "
"Most probably, this is due to a failed atomic slot migration. "
"CLUSTER SETSLOT cannot be used at this time. "
"Please retry later once the trim job is completed.", slot);
return 1;
}
if (!strcasecmp(c->argv[3]->ptr,"migrating") && c->argc == 5) {
if (server.cluster->slots[slot] != myself) {
addReplyErrorFormat(c,"I'm not the owner of hash slot %u",slot);
@ -6411,6 +6456,10 @@ int clusterCommandSpecial(client *c) {
} else if (!strcasecmp(c->argv[1]->ptr,"links") && c->argc == 2) {
/* CLUSTER LINKS */
addReplyClusterLinksDescription(c);
} else if (!strcasecmp(c->argv[1]->ptr, "migration")) {
clusterMigrationCommand(c);
} else if (!strcasecmp(c->argv[1]->ptr,"syncslots") && c->argc >= 3) {
clusterSyncSlotsCommand(c);
} else {
return 0;
}
@ -6515,4 +6564,64 @@ int clusterAllowFailoverCmd(client *c) {
void clusterPromoteSelfToMaster(void) {
replicationUnsetMaster();
asmFinalizeMasterTask();
}
int clusterAsmOnEvent(const char *task_id, int event, void *arg) {
UNUSED(arg);
sds str = NULL;
slotRangeArray *slots = asmTaskGetSlotRanges(task_id);
if (slots) str = slotRangeArrayToString(slots);
switch (event) {
case ASM_EVENT_IMPORT_STARTED:
serverLog(LL_NOTICE, "Import task %s started for slots: %s", task_id, str);
break;
case ASM_EVENT_IMPORT_FAILED:
serverLog(LL_NOTICE, "Import task %s failed for slots: %s", task_id, str);
break;
case ASM_EVENT_TAKEOVER:
serverLog(LL_NOTICE, "Import task %s is ready to takeover slots: %s", task_id, str);
for (int i = 0; i < slots->num_ranges; i++) {
slotRange *sr = &slots->ranges[i];
for (int j = sr->start; j <= sr->end; j++) {
clusterDelSlot(j);
clusterAddSlot(myself, j);
}
}
/* New config and Bump new config */
clusterBumpConfigEpochWithoutConsensus();
clusterSaveConfigOrDie(1);
clusterBroadcastPong(CLUSTER_BROADCAST_ALL);
clusterAsmProcess(task_id, ASM_EVENT_DONE, NULL, NULL);
break;
case ASM_EVENT_IMPORT_COMPLETED:
serverLog(LL_NOTICE, "Import task %s completed for slots: %s", task_id, str);
break;
case ASM_EVENT_MIGRATE_STARTED:
serverLog(LL_NOTICE, "Migrate task %s started for slots: %s", task_id, str);
break;
case ASM_EVENT_MIGRATE_FAILED:
serverLog(LL_NOTICE, "Migrate task %s failed for slots: %s", task_id, str);
unpauseActions(PAUSE_DURING_SLOT_HANDOFF);
break;
case ASM_EVENT_HANDOFF_PREP:
serverLog(LL_NOTICE, "Migrate task %s preparing to handoff for slots: %s", task_id, str);
pauseActions(PAUSE_DURING_SLOT_HANDOFF,
LLONG_MAX,
PAUSE_ACTIONS_CLIENT_WRITE_SET);
clusterAsmProcess(task_id, ASM_EVENT_HANDOFF, NULL, NULL);
break;
case ASM_EVENT_MIGRATE_COMPLETED:
serverLog(LL_NOTICE, "Migrate task %s completed for slots: %s", task_id, str);
unpauseActions(PAUSE_DURING_SLOT_HANDOFF);
break;
default:
break;
}
sdsfree(str);
return C_OK;
}

View file

@ -683,6 +683,53 @@ struct COMMAND_ARG CLUSTER_MEET_Args[] = {
{MAKE_ARG("cluster-bus-port",ARG_TYPE_INTEGER,-1,NULL,NULL,"4.0.0",CMD_ARG_OPTIONAL,0,NULL)},
};
/********** CLUSTER MIGRATION ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* CLUSTER MIGRATION history */
#define CLUSTER_MIGRATION_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* CLUSTER MIGRATION tips */
#define CLUSTER_MIGRATION_Tips NULL
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* CLUSTER MIGRATION key specs */
#define CLUSTER_MIGRATION_Keyspecs NULL
#endif
/* CLUSTER MIGRATION subcommand import argument table */
struct COMMAND_ARG CLUSTER_MIGRATION_subcommand_import_Subargs[] = {
{MAKE_ARG("start-slot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("end-slot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* CLUSTER MIGRATION subcommand cancel argument table */
struct COMMAND_ARG CLUSTER_MIGRATION_subcommand_cancel_Subargs[] = {
{MAKE_ARG("task-id",ARG_TYPE_STRING,-1,"ID",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("all",ARG_TYPE_PURE_TOKEN,-1,"ALL",NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* CLUSTER MIGRATION subcommand status argument table */
struct COMMAND_ARG CLUSTER_MIGRATION_subcommand_status_Subargs[] = {
{MAKE_ARG("task-id",ARG_TYPE_STRING,-1,"ID",NULL,NULL,CMD_ARG_OPTIONAL,0,NULL)},
{MAKE_ARG("all",ARG_TYPE_PURE_TOKEN,-1,"ALL",NULL,NULL,CMD_ARG_OPTIONAL,0,NULL)},
};
/* CLUSTER MIGRATION subcommand argument table */
struct COMMAND_ARG CLUSTER_MIGRATION_subcommand_Subargs[] = {
{MAKE_ARG("import",ARG_TYPE_BLOCK,-1,"IMPORT",NULL,NULL,CMD_ARG_MULTIPLE,2,NULL),.subargs=CLUSTER_MIGRATION_subcommand_import_Subargs},
{MAKE_ARG("cancel",ARG_TYPE_ONEOF,-1,"CANCEL",NULL,NULL,CMD_ARG_NONE,2,NULL),.subargs=CLUSTER_MIGRATION_subcommand_cancel_Subargs},
{MAKE_ARG("status",ARG_TYPE_BLOCK,-1,"STATUS",NULL,NULL,CMD_ARG_NONE,2,NULL),.subargs=CLUSTER_MIGRATION_subcommand_status_Subargs},
};
/* CLUSTER MIGRATION argument table */
struct COMMAND_ARG CLUSTER_MIGRATION_Args[] = {
{MAKE_ARG("subcommand",ARG_TYPE_ONEOF,-1,NULL,NULL,NULL,CMD_ARG_NONE,3,NULL),.subargs=CLUSTER_MIGRATION_subcommand_Subargs},
};
/********** CLUSTER MYID ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
@ -997,6 +1044,65 @@ const char *CLUSTER_SLOTS_Tips[] = {
#define CLUSTER_SLOTS_Keyspecs NULL
#endif
/********** CLUSTER SYNCSLOTS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* CLUSTER SYNCSLOTS history */
#define CLUSTER_SYNCSLOTS_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* CLUSTER SYNCSLOTS tips */
const char *CLUSTER_SYNCSLOTS_Tips[] = {
"nondeterministic_output",
};
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* CLUSTER SYNCSLOTS key specs */
#define CLUSTER_SYNCSLOTS_Keyspecs NULL
#endif
/* CLUSTER SYNCSLOTS subcommand sync slot_range argument table */
struct COMMAND_ARG CLUSTER_SYNCSLOTS_subcommand_sync_slot_range_Subargs[] = {
{MAKE_ARG("start-slot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("end-slot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* CLUSTER SYNCSLOTS subcommand sync argument table */
struct COMMAND_ARG CLUSTER_SYNCSLOTS_subcommand_sync_Subargs[] = {
{MAKE_ARG("task-id",ARG_TYPE_STRING,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("slot-range",ARG_TYPE_BLOCK,-1,NULL,NULL,NULL,CMD_ARG_MULTIPLE,2,NULL),.subargs=CLUSTER_SYNCSLOTS_subcommand_sync_slot_range_Subargs},
};
/* CLUSTER SYNCSLOTS subcommand ack argument table */
struct COMMAND_ARG CLUSTER_SYNCSLOTS_subcommand_ack_Subargs[] = {
{MAKE_ARG("state",ARG_TYPE_STRING,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("offset",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* CLUSTER SYNCSLOTS subcommand conf argument table */
struct COMMAND_ARG CLUSTER_SYNCSLOTS_subcommand_conf_Subargs[] = {
{MAKE_ARG("option",ARG_TYPE_STRING,-1,NULL,NULL,NULL,CMD_ARG_MULTIPLE,0,NULL)},
{MAKE_ARG("value",ARG_TYPE_STRING,-1,NULL,NULL,NULL,CMD_ARG_MULTIPLE,0,NULL)},
};
/* CLUSTER SYNCSLOTS subcommand argument table */
struct COMMAND_ARG CLUSTER_SYNCSLOTS_subcommand_Subargs[] = {
{MAKE_ARG("sync",ARG_TYPE_BLOCK,-1,"SYNC",NULL,NULL,CMD_ARG_NONE,2,NULL),.subargs=CLUSTER_SYNCSLOTS_subcommand_sync_Subargs},
{MAKE_ARG("task-id",ARG_TYPE_STRING,-1,"RDBCHANNEL",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("snapshot-eof",ARG_TYPE_PURE_TOKEN,-1,"SNAPSHOT-EOF",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("stream-eof",ARG_TYPE_PURE_TOKEN,-1,"STREAM-EOF",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("ack",ARG_TYPE_BLOCK,-1,"ACK",NULL,NULL,CMD_ARG_NONE,2,NULL),.subargs=CLUSTER_SYNCSLOTS_subcommand_ack_Subargs},
{MAKE_ARG("error",ARG_TYPE_STRING,-1,"FAIL",NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("conf",ARG_TYPE_BLOCK,-1,"CONF",NULL,NULL,CMD_ARG_NONE,2,NULL),.subargs=CLUSTER_SYNCSLOTS_subcommand_conf_Subargs},
};
/* CLUSTER SYNCSLOTS argument table */
struct COMMAND_ARG CLUSTER_SYNCSLOTS_Args[] = {
{MAKE_ARG("subcommand",ARG_TYPE_ONEOF,-1,NULL,NULL,NULL,CMD_ARG_NONE,7,NULL),.subargs=CLUSTER_SYNCSLOTS_subcommand_Subargs},
};
/* CLUSTER command table */
struct COMMAND_STRUCT CLUSTER_Subcommands[] = {
{MAKE_CMD("addslots","Assigns new hash slots to a node.","O(N) where N is the total number of hash slot arguments","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_ADDSLOTS_History,0,CLUSTER_ADDSLOTS_Tips,0,clusterCommand,-3,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_ADDSLOTS_Keyspecs,0,NULL,1),.args=CLUSTER_ADDSLOTS_Args},
@ -1015,6 +1121,7 @@ struct COMMAND_STRUCT CLUSTER_Subcommands[] = {
{MAKE_CMD("keyslot","Returns the hash slot for a key.","O(N) where N is the number of bytes in the key","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_KEYSLOT_History,0,CLUSTER_KEYSLOT_Tips,0,clusterCommand,3,CMD_STALE,0,CLUSTER_KEYSLOT_Keyspecs,0,NULL,1),.args=CLUSTER_KEYSLOT_Args},
{MAKE_CMD("links","Returns a list of all TCP links to and from peer nodes.","O(N) where N is the total number of Cluster nodes","7.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_LINKS_History,0,CLUSTER_LINKS_Tips,1,clusterCommand,2,CMD_STALE,0,CLUSTER_LINKS_Keyspecs,0,NULL,0)},
{MAKE_CMD("meet","Forces a node to handshake with another node.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MEET_History,1,CLUSTER_MEET_Tips,0,clusterCommand,-4,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_MEET_Keyspecs,0,NULL,3),.args=CLUSTER_MEET_Args},
{MAKE_CMD("migration","Start, monitor and cancel slot migration.","O(N) where N is the total number of the slots between the start slot and end slot arguments.","8.4.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MIGRATION_History,0,CLUSTER_MIGRATION_Tips,0,clusterCommand,-4,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_MIGRATION_Keyspecs,0,NULL,1),.args=CLUSTER_MIGRATION_Args},
{MAKE_CMD("myid","Returns the ID of a node.","O(1)","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MYID_History,0,CLUSTER_MYID_Tips,0,clusterCommand,2,CMD_STALE,0,CLUSTER_MYID_Keyspecs,0,NULL,0)},
{MAKE_CMD("myshardid","Returns the shard ID of a node.","O(1)","7.2.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_MYSHARDID_History,0,CLUSTER_MYSHARDID_Tips,1,clusterCommand,2,CMD_STALE,0,CLUSTER_MYSHARDID_Keyspecs,0,NULL,0)},
{MAKE_CMD("nodes","Returns the cluster configuration for a node.","O(N) where N is the total number of Cluster nodes","3.0.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_NODES_History,0,CLUSTER_NODES_Tips,1,clusterCommand,2,CMD_STALE,0,CLUSTER_NODES_Keyspecs,0,NULL,0)},
@ -1028,6 +1135,7 @@ struct COMMAND_STRUCT CLUSTER_Subcommands[] = {
{MAKE_CMD("slaves","Lists the replica nodes of a master node.","O(N) where N is the number of replicas.","3.0.0",CMD_DOC_DEPRECATED,"`CLUSTER REPLICAS`","5.0.0","cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SLAVES_History,0,CLUSTER_SLAVES_Tips,1,clusterCommand,3,CMD_ADMIN|CMD_STALE,0,CLUSTER_SLAVES_Keyspecs,0,NULL,1),.args=CLUSTER_SLAVES_Args},
{MAKE_CMD("slot-stats","Return an array of slot usage statistics for slots assigned to the current node.","O(N) where N is the total number of slots based on arguments. O(N*log(N)) with ORDERBY subcommand.","8.2.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SLOT_STATS_History,0,CLUSTER_SLOT_STATS_Tips,2,clusterSlotStatsCommand,-4,CMD_STALE|CMD_LOADING,0,CLUSTER_SLOT_STATS_Keyspecs,0,NULL,1),.args=CLUSTER_SLOT_STATS_Args},
{MAKE_CMD("slots","Returns the mapping of cluster slots to nodes.","O(N) where N is the total number of Cluster nodes","3.0.0",CMD_DOC_DEPRECATED,"`CLUSTER SHARDS`","7.0.0","cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SLOTS_History,2,CLUSTER_SLOTS_Tips,1,clusterCommand,2,CMD_LOADING|CMD_STALE,0,CLUSTER_SLOTS_Keyspecs,0,NULL,0)},
{MAKE_CMD("syncslots","Internal command for atomic slot migration protocol between cluster nodes.","O(1)","8.4.0",CMD_DOC_NONE,NULL,NULL,"cluster",COMMAND_GROUP_CLUSTER,CLUSTER_SYNCSLOTS_History,0,CLUSTER_SYNCSLOTS_Tips,1,clusterCommand,-3,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_STALE,0,CLUSTER_SYNCSLOTS_Keyspecs,0,NULL,1),.args=CLUSTER_SYNCSLOTS_Args},
{0}
};
@ -8162,6 +8270,40 @@ const char *TIME_Tips[] = {
#define TIME_Keyspecs NULL
#endif
/********** TRIMSLOTS ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
/* TRIMSLOTS history */
#define TRIMSLOTS_History NULL
#endif
#ifndef SKIP_CMD_TIPS_TABLE
/* TRIMSLOTS tips */
#define TRIMSLOTS_Tips NULL
#endif
#ifndef SKIP_CMD_KEY_SPECS_TABLE
/* TRIMSLOTS key specs */
#define TRIMSLOTS_Keyspecs NULL
#endif
/* TRIMSLOTS ranges slots argument table */
struct COMMAND_ARG TRIMSLOTS_ranges_slots_Subargs[] = {
{MAKE_ARG("startslot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("endslot",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
};
/* TRIMSLOTS ranges argument table */
struct COMMAND_ARG TRIMSLOTS_ranges_Subargs[] = {
{MAKE_ARG("numranges",ARG_TYPE_INTEGER,-1,NULL,NULL,NULL,CMD_ARG_NONE,0,NULL)},
{MAKE_ARG("slots",ARG_TYPE_BLOCK,-1,NULL,NULL,NULL,CMD_ARG_MULTIPLE,2,NULL),.subargs=TRIMSLOTS_ranges_slots_Subargs},
};
/* TRIMSLOTS argument table */
struct COMMAND_ARG TRIMSLOTS_Args[] = {
{MAKE_ARG("ranges",ARG_TYPE_BLOCK,-1,"RANGES",NULL,NULL,CMD_ARG_NONE,2,NULL),.subargs=TRIMSLOTS_ranges_Subargs},
};
/********** SADD ********************/
#ifndef SKIP_CMD_HISTORY_TABLE
@ -11483,6 +11625,7 @@ struct COMMAND_STRUCT redisCommandTable[] = {
{MAKE_CMD("swapdb","Swaps two Redis databases.","O(N) where N is the count of clients watching or blocking on keys from both databases.","4.0.0",CMD_DOC_NONE,NULL,NULL,"server",COMMAND_GROUP_SERVER,SWAPDB_History,0,SWAPDB_Tips,0,swapdbCommand,3,CMD_WRITE|CMD_FAST,ACL_CATEGORY_KEYSPACE|ACL_CATEGORY_DANGEROUS,SWAPDB_Keyspecs,0,NULL,2),.args=SWAPDB_Args},
{MAKE_CMD("sync","An internal command used in replication.",NULL,"1.0.0",CMD_DOC_NONE,NULL,NULL,"server",COMMAND_GROUP_SERVER,SYNC_History,0,SYNC_Tips,0,syncCommand,1,CMD_NO_ASYNC_LOADING|CMD_ADMIN|CMD_NO_MULTI|CMD_NOSCRIPT,0,SYNC_Keyspecs,0,NULL,0)},
{MAKE_CMD("time","Returns the server time.","O(1)","2.6.0",CMD_DOC_NONE,NULL,NULL,"server",COMMAND_GROUP_SERVER,TIME_History,0,TIME_Tips,1,timeCommand,1,CMD_LOADING|CMD_STALE|CMD_FAST,0,TIME_Keyspecs,0,NULL,0)},
{MAKE_CMD("trimslots","Trim the keys that belong to specified slots.","O(N) where N is the total number of keys in all databases","8.4.0",CMD_DOC_NONE,NULL,NULL,"server",COMMAND_GROUP_SERVER,TRIMSLOTS_History,0,TRIMSLOTS_Tips,0,trimslotsCommand,-5,CMD_WRITE,ACL_CATEGORY_KEYSPACE|ACL_CATEGORY_DANGEROUS,TRIMSLOTS_Keyspecs,0,NULL,1),.args=TRIMSLOTS_Args},
/* set */
{MAKE_CMD("sadd","Adds one or more members to a set. Creates the key if it doesn't exist.","O(1) for each element added, so O(N) to add N elements when the command is called with multiple arguments.","1.0.0",CMD_DOC_NONE,NULL,NULL,"set",COMMAND_GROUP_SET,SADD_History,1,SADD_Tips,0,saddCommand,-3,CMD_WRITE|CMD_DENYOOM|CMD_FAST,ACL_CATEGORY_SET,SADD_Keyspecs,1,NULL,2),.args=SADD_Args},
{MAKE_CMD("scard","Returns the number of members in a set.","O(1)","1.0.0",CMD_DOC_NONE,NULL,NULL,"set",COMMAND_GROUP_SET,SCARD_History,0,SCARD_Tips,0,scardCommand,2,CMD_READONLY|CMD_FAST,ACL_CATEGORY_SET,SCARD_Keyspecs,1,NULL,1),.args=SCARD_Args},

View file

@ -0,0 +1,141 @@
{
"MIGRATION": {
"summary": "Start, monitor and cancel slot migration.",
"complexity": "O(N) where N is the total number of the slots between the start slot and end slot arguments.",
"group": "cluster",
"since": "8.4.0",
"arity": -4,
"container": "CLUSTER",
"function": "clusterCommand",
"command_flags": [
"NO_ASYNC_LOADING",
"ADMIN",
"STALE"
],
"arguments": [
{
"name": "subcommand",
"type": "oneof",
"arguments": [
{
"name": "import",
"token": "IMPORT",
"type": "block",
"multiple": true,
"arguments": [
{
"name": "start-slot",
"type": "integer"
},
{
"name": "end-slot",
"type": "integer"
}
]
},
{
"name": "cancel",
"token": "CANCEL",
"type": "oneof",
"arguments": [
{
"token": "ID",
"name": "task-id",
"type": "string"
},
{
"name": "all",
"token": "ALL",
"type": "pure-token"
}
]
},
{
"name": "status",
"token": "STATUS",
"type": "block",
"arguments": [
{
"token": "ID",
"name": "task-id",
"type": "string",
"optional": true
},
{
"name": "all",
"token": "ALL",
"type": "pure-token",
"optional": true
}
]
}
]
}
],
"reply_schema": {
"oneOf": [
{
"description": "Reply to CLUSTER MIGRATION IMPORT, returns the task ID.",
"type": "string"
},
{
"description": "Reply to CLUSTER MIGRATION CANCEL, number of cancelled migration operations.",
"type": "integer"
},
{
"description": "Reply to CLUSTER MIGRATION STATUS, array of migration operation details.",
"type": "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"id": {
"type": "string"
},
"slots": {
"type": "string"
},
"source": {
"type": "string"
},
"dest": {
"type": "string"
},
"operation": {
"oneOf": [
{
"const": "import"
},
{
"const": "migrate"
}
]
},
"state": {
"type": "string"
},
"last_error": {
"type": "string"
},
"retries": {
"type": "integer"
},
"create_time": {
"type": "integer"
},
"start_time": {
"type": "integer"
},
"end_time": {
"type": "integer"
},
"write_pause_ms": {
"type": "integer"
}
}
}
}
]
}
}
}

View file

@ -0,0 +1,117 @@
{
"SYNCSLOTS": {
"summary": "Internal command for atomic slot migration protocol between cluster nodes.",
"complexity": "O(1)",
"group": "cluster",
"since": "8.4.0",
"arity": -3,
"container": "CLUSTER",
"function": "clusterCommand",
"command_flags": [
"NO_ASYNC_LOADING",
"ADMIN",
"STALE"
],
"command_tips": [
"NONDETERMINISTIC_OUTPUT"
],
"arguments": [
{
"name": "subcommand",
"type": "oneof",
"arguments": [
{
"name": "sync",
"token": "SYNC",
"type": "block",
"arguments": [
{
"name": "task-id",
"type": "string"
},
{
"name": "slot-range",
"type": "block",
"multiple": true,
"arguments": [
{
"name": "start-slot",
"type": "integer"
},
{
"name": "end-slot",
"type": "integer"
}
]
}
]
},
{
"token": "RDBCHANNEL",
"name": "task-id",
"type": "string"
},
{
"name": "snapshot-eof",
"token": "SNAPSHOT-EOF",
"type": "pure-token"
},
{
"name": "stream-eof",
"token": "STREAM-EOF",
"type": "pure-token"
},
{
"name": "ack",
"token": "ACK",
"type": "block",
"arguments": [
{
"name": "state",
"type": "string"
},
{
"name": "offset",
"type": "integer"
}
]
},
{
"token": "FAIL",
"name": "error",
"type": "string"
},
{
"name": "conf",
"token": "CONF",
"type": "block",
"arguments": [
{
"name": "option",
"type": "string",
"multiple": true
},
{
"name": "value",
"type": "string",
"multiple": true
}
]
}
]
}
],
"reply_schema": {
"oneOf": [
{
"description": "Reply to CLUSTER SYNCSLOTS SYNC, returns special RDB channel sync response.",
"const": "RDBCHANNELSYNCSLOTS"
},
{
"description": "Reply to CLUSTER SYNCSLOTS CONF and other subcommands.",
"const": "OK"
}
]
}
}
}

View file

@ -0,0 +1,48 @@
{
"TRIMSLOTS": {
"summary": "Trim the keys that belong to specified slots.",
"complexity": "O(N) where N is the total number of keys in all databases",
"group": "server",
"since": "8.4.0",
"arity": -5,
"function": "trimslotsCommand",
"command_flags": [
"WRITE"
],
"acl_categories": [
"KEYSPACE",
"DANGEROUS"
],
"reply_schema": {
"const": "OK"
},
"arguments": [
{
"name": "ranges",
"token": "RANGES",
"type": "block",
"arguments": [
{
"name": "numranges",
"type": "integer"
},
{
"name": "slots",
"type": "block",
"multiple": true,
"arguments": [
{
"name": "startslot",
"type": "integer"
},
{
"name": "endslot",
"type": "integer"
}
]
}
]
}
]
}
}

View file

@ -3217,6 +3217,7 @@ standardConfig static_configs[] = {
createIntConfig("shutdown-timeout", NULL, MODIFIABLE_CONFIG, 0, INT_MAX, server.shutdown_timeout, 10, INTEGER_CONFIG, NULL, NULL),
createIntConfig("repl-diskless-sync-max-replicas", NULL, MODIFIABLE_CONFIG, 0, INT_MAX, server.repl_diskless_sync_max_replicas, 0, INTEGER_CONFIG, NULL, NULL),
createIntConfig("cluster-compatibility-sample-ratio", NULL, MODIFIABLE_CONFIG, 0, 100, server.cluster_compatibility_sample_ratio, 0, INTEGER_CONFIG, NULL, NULL),
createIntConfig("cluster-slot-migration-max-archived-tasks", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1, INT_MAX, server.asm_max_archived_tasks, 32, INTEGER_CONFIG, NULL, NULL),
/* Unsigned int configs */
createUIntConfig("maxclients", NULL, MODIFIABLE_CONFIG, 1, UINT_MAX, server.maxclients, 10000, INTEGER_CONFIG, NULL, updateMaxclients),
@ -3237,6 +3238,9 @@ standardConfig static_configs[] = {
createLongLongConfig("busy-reply-threshold", "lua-time-limit", MODIFIABLE_CONFIG, 0, LONG_MAX, server.busy_reply_threshold, 5000, INTEGER_CONFIG, NULL, NULL),/* milliseconds */
createLongLongConfig("cluster-node-timeout", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.cluster_node_timeout, 15000, INTEGER_CONFIG, NULL, NULL),
createLongLongConfig("cluster-ping-interval", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, LLONG_MAX, server.cluster_ping_interval, 0, INTEGER_CONFIG, NULL, NULL),
createLongLongConfig("cluster-slot-migration-handoff-max-lag-bytes", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.asm_handoff_max_lag_bytes, 1*1024*1024, MEMORY_CONFIG, NULL, NULL), /* 1MB */
createLongLongConfig("cluster-slot-migration-write-pause-timeout", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.asm_write_pause_timeout, 10*1000, INTEGER_CONFIG, NULL, NULL), /* 10 seconds */
createLongLongConfig("cluster-slot-migration-sync-buffer-drain-timeout", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 0, LLONG_MAX, server.asm_sync_buffer_drain_timeout, 60000, INTEGER_CONFIG, NULL, NULL), /* 60 seconds */
createLongLongConfig("slowlog-log-slower-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.slowlog_log_slower_than, 10000, INTEGER_CONFIG, NULL, NULL),
createLongLongConfig("latency-monitor-threshold", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.latency_monitor_threshold, 0, INTEGER_CONFIG, NULL, NULL),
createLongLongConfig("proto-max-bulk-len", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, 1024*1024, LONG_MAX, server.proto_max_bulk_len, 512ll*1024*1024, MEMORY_CONFIG, NULL, NULL), /* Bulk request max size */

155
src/db.c
View file

@ -18,6 +18,7 @@
#include "latency.h"
#include "script.h"
#include "functions.h"
#include "cluster_asm.h"
#include "redisassert.h"
#include <signal.h>
@ -395,8 +396,8 @@ int getKeySlot(sds key) {
return slot;
}
/* Return the slot of the key in the command. IO threads use this function
* to calculate slot to reduce main-thread load */
/* Return the slot of the key in the command.
* GETSLOT_NOKEYS if no keys, GETSLOT_CROSSSLOT if cross slot, otherwise the slot number. */
int getSlotFromCommand(struct redisCommand *cmd, robj **argv, int argc) {
int slot = -1;
if (!cmd || !server.cluster_enabled) return slot;
@ -404,10 +405,18 @@ int getSlotFromCommand(struct redisCommand *cmd, robj **argv, int argc) {
/* Get the keys from the command */
getKeysResult result = GETKEYS_RESULT_INIT;
int numkeys = getKeysFromCommand(cmd, argv, argc, &result);
if (numkeys > 0) {
/* Get the slot of the first key */
robj *first = argv[result.keys[0].pos];
slot = keyHashSlot(first->ptr, (int)sdslen(first->ptr));
keyReference *keyindex = result.keys;
/* Get slot of each key and check if they are all the same */
for (int j = 0; j < numkeys; j++) {
robj *thiskey = argv[keyindex[j].pos];
int thisslot = keyHashSlot((char*)thiskey->ptr, sdslen(thiskey->ptr));
if (slot == GETSLOT_NOKEYS) {
slot = thisslot;
} else if (slot != thisslot) {
slot = GETSLOT_CROSSSLOT; /* Mark as cross slot */
break;
}
}
getKeysFreeResult(&result);
return slot;
@ -646,6 +655,18 @@ void setKeyByLink(client *c, redisDb *db, robj *key, robj **valref, int flags, d
signalModifiedKey(c,db,key);
}
/* During atomic slot migration, keys that are being imported are in an
* intermediate state. we cannot access them and therefore skip them.
*
* This callback function now is used by:
* - dbRandomKey
* - keysCommand
* - scanCommand
*/
static int accessKeysShouldSkipDictIndex(int didx) {
return !clusterCanAccessKeysInSlot(didx);
}
/* Return a random key, in form of a Redis object.
* If there are no keys, NULL is returned.
*
@ -657,7 +678,8 @@ robj *dbRandomKey(redisDb *db) {
while(1) {
robj *keyobj;
int randomSlot = kvstoreGetFairRandomDictIndex(db->keys);
int randomSlot = kvstoreGetFairRandomDictIndex(db->keys, accessKeysShouldSkipDictIndex, 16, 1);
if (randomSlot == -1) return NULL;
de = kvstoreDictGetFairRandomKey(db->keys, randomSlot);
if (de == NULL) return NULL;
@ -871,6 +893,9 @@ long long emptyData(int dbnum, int flags, void(callback)(dict*)) {
return -1;
}
if (dbnum == -1 || dbnum == 0)
asmCancelTrimJobs();
/* Fire the flushdb modules event. */
moduleFireServerEvent(REDISMODULE_EVENT_FLUSHDB,
REDISMODULE_SUBEVENT_FLUSHDB_START,
@ -879,7 +904,7 @@ long long emptyData(int dbnum, int flags, void(callback)(dict*)) {
/* Make sure the WATCHed keys are affected by the FLUSH* commands.
* Note that we need to call the function while the keys are still
* there. */
signalFlushedDb(dbnum, async);
signalFlushedDb(dbnum, async, NULL);
/* Empty redis database structure. */
removed = emptyDbStructure(server.db, dbnum, async, callback);
@ -969,7 +994,7 @@ void signalModifiedKey(client *c, redisDb *db, robj *key) {
trackingInvalidateKey(c,key,1);
}
void signalFlushedDb(int dbid, int async) {
void signalFlushedDb(int dbid, int async, slotRangeArray *slots) {
int startdb, enddb;
if (dbid == -1) {
startdb = 0;
@ -979,8 +1004,8 @@ void signalFlushedDb(int dbid, int async) {
}
for (int j = startdb; j <= enddb; j++) {
scanDatabaseForDeletedKeys(&server.db[j], NULL);
touchAllWatchedKeysInDb(&server.db[j], NULL);
scanDatabaseForDeletedKeys(&server.db[j], NULL, slots);
touchAllWatchedKeysInDb(&server.db[j], NULL, slots);
}
trackingInvalidateKeysOnFlush(async);
@ -1046,13 +1071,13 @@ void flushAllDataAndResetRDB(int flags) {
*
* Utilized by commands SFLUSH, FLUSHALL and FLUSHDB.
*/
void flushallSyncBgDone(uint64_t client_id, void *sflush) {
SlotsFlush *slotsFlush = sflush;
void flushallSyncBgDone(uint64_t client_id, void *userdata) {
slotRangeArray *slots = userdata;
client *c = lookupClientByID(client_id);
/* Verify that client still exists and being blocked. */
if (!(c && c->flags & CLIENT_BLOCKED)) {
zfree(sflush);
slotRangeArrayFree(slots);
return;
}
@ -1063,9 +1088,9 @@ void flushallSyncBgDone(uint64_t client_id, void *sflush) {
/* Don't update blocked_us since command was processed in bg by lazy_free thread */
updateStatsOnUnblock(c, 0 /*blocked_us*/, elapsedUs(c->bstate.lazyfreeStartTime), 0);
/* Only SFLUSH command pass pointer to `SlotsFlush` */
if (slotsFlush)
replySlotsFlushAndFree(c, slotsFlush);
/* Only SFLUSH command pass user data pointer. */
if (slots)
replySlotsFlushAndFree(c, slots);
else
addReply(c, shared.ok);
@ -1087,16 +1112,16 @@ void flushallSyncBgDone(uint64_t client_id, void *sflush) {
server.current_client = old_client;
}
/* Common flush command implementation for FLUSHALL and FLUSHDB.
/* Common flush command implementation for FLUSHALL, FLUSHDB and SFLUSH.
*
* Return 1 indicates that flush SYNC is actually running in bg as blocking ASYNC
* Return 0 otherwise
*
* sflush - provided only by SFLUSH command, otherwise NULL. Will be used on
* completion to reply with the slots flush result. Ownership is passed
* to the completion job in case of `blocking_async`.
* slots - provided only by SFLUSH command, otherwise NULL. Will be used on
* completion to reply with the slots flush result. Ownership is passed
* to the completion job in case of `blocking_async`.
*/
int flushCommandCommon(client *c, int type, int flags, SlotsFlush *sflush) {
int flushCommandCommon(client *c, int type, int flags, slotRangeArray *slots) {
int blocking_async = 0; /* Flush SYNC option to run as blocking ASYNC */
/* in case of SYNC, check if we can optimize and run it in bg as blocking ASYNC */
@ -1106,6 +1131,9 @@ int flushCommandCommon(client *c, int type, int flags, SlotsFlush *sflush) {
blocking_async = 1;
}
/* Cancel all ASM tasks that overlap with the given slot ranges. */
clusterAsmCancelBySlotRangeArray(slots, c->argv[0]->ptr);
if (type == FLUSH_TYPE_ALL)
flushAllDataAndResetRDB(flags | EMPTYDB_NOFUNCTIONS);
else
@ -1128,7 +1156,7 @@ int flushCommandCommon(client *c, int type, int flags, SlotsFlush *sflush) {
* avoid command from being reset during unblock. */
c->flags |= CLIENT_PENDING_COMMAND;
blockClient(c,BLOCKED_LAZYFREE);
bioCreateCompRq(BIO_WORKER_LAZY_FREE, flushallSyncBgDone, c->id, sflush);
bioCreateCompRq(BIO_WORKER_LAZY_FREE, flushallSyncBgDone, c->id, slots);
}
#if defined(USE_JEMALLOC)
@ -1349,7 +1377,7 @@ void keysCommand(client *c) {
kvstoreDictIterator *kvs_di = NULL;
kvstoreIterator *kvs_it = NULL;
if (pslot != -1) {
if (!kvstoreDictSize(c->db->keys, pslot)) {
if (!kvstoreDictSize(c->db->keys, pslot) || accessKeysShouldSkipDictIndex(pslot)) {
/* Requested slot is empty */
setDeferredArrayLen(c,replylen,0);
return;
@ -1360,6 +1388,10 @@ void keysCommand(client *c) {
}
while ((de = kvs_di ? kvstoreDictIteratorNext(kvs_di) : kvstoreIteratorNext(kvs_it)) != NULL) {
if (kvs_it && accessKeysShouldSkipDictIndex(kvstoreIteratorGetCurrentDictIndex(kvs_it))) {
continue;
}
kvobj *kv = dictGetKV(de);
sds key = kvobjGetKey(kv);
@ -1529,6 +1561,11 @@ char *getObjectTypeName(robj *o) {
}
}
static int scanShouldSkipDict(dict *d, int didx) {
UNUSED(d);
return accessKeysShouldSkipDictIndex(didx);
}
/* This command implements SCAN, HSCAN and SSCAN commands.
* If object 'o' is passed, then it must be a Hash, Set or Zset object, otherwise
* if 'o' is NULL the command will operate on the dictionary associated with
@ -1684,7 +1721,7 @@ void scanGenericCommand(client *c, robj *o, unsigned long long cursor) {
/* In cluster mode there is a separate dictionary for each slot.
* If cursor is empty, we should try exploring next non-empty slot. */
if (o == NULL) {
cursor = kvstoreScan(c->db->keys, cursor, onlydidx, scanCallback, NULL, &data);
cursor = kvstoreScan(c->db->keys, cursor, onlydidx, scanCallback, scanShouldSkipDict, &data);
} else {
cursor = dictScan(ht, cursor, scanCallback, &data);
}
@ -1856,7 +1893,7 @@ void scanCommand(client *c) {
}
void dbsizeCommand(client *c) {
addReplyLongLong(c,kvstoreSize(c->db->keys));
addReplyLongLong(c,dbSize(c->db));
}
void lastsaveCommand(client *c) {
@ -2222,16 +2259,20 @@ void scanDatabaseForReadyKeys(redisDb *db) {
dictResetIterator(&di);
}
/* Since we are unblocking XREADGROUP clients in the event the
* key was deleted/overwritten we must do the same in case the
* database was flushed/swapped. */
void scanDatabaseForDeletedKeys(redisDb *emptied, redisDb *replaced_with) {
/* Since we are unblocking XREADGROUP clients in the event the key was
* deleted/overwritten we must do the same in case the database was
* flushed/swapped. If 'slots' is not NULL, only keys in the specified slot
* range are considered. */
void scanDatabaseForDeletedKeys(redisDb *emptied, redisDb *replaced_with, slotRangeArray *slots) {
dictEntry *de;
dictIterator di;
dictInitSafeIterator(&di, emptied->blocking_keys);
while((de = dictNext(&di)) != NULL) {
robj *key = dictGetKey(de);
/* Check if key belongs to the slot range. */
if (slots && !slotRangeArrayContains(slots, keyHashSlot(key->ptr, sdslen(key->ptr))))
continue;
int existed = 0, exists = 0;
int original_type = -1, curr_type = -1;
@ -2272,12 +2313,12 @@ int dbSwapDatabases(int id1, int id2) {
/* Swapdb should make transaction fail if there is any
* client watching keys */
touchAllWatchedKeysInDb(db1, db2);
touchAllWatchedKeysInDb(db2, db1);
touchAllWatchedKeysInDb(db1, db2, NULL);
touchAllWatchedKeysInDb(db2, db1, NULL);
/* Try to unblock any XREADGROUP clients if the key no longer exists. */
scanDatabaseForDeletedKeys(db1, db2);
scanDatabaseForDeletedKeys(db2, db1);
scanDatabaseForDeletedKeys(db1, db2, NULL);
scanDatabaseForDeletedKeys(db2, db1, NULL);
/* Swap hash tables. Note that we don't swap blocking_keys,
* ready_keys and watched_keys, since we want clients to
@ -2318,10 +2359,10 @@ void swapMainDbWithTempDb(redisDb *tempDb) {
/* Swapping databases should make transaction fail if there is any
* client watching keys. */
touchAllWatchedKeysInDb(activedb, newdb);
touchAllWatchedKeysInDb(activedb, newdb, NULL);
/* Try to unblock any XREADGROUP clients if the key no longer exists. */
scanDatabaseForDeletedKeys(activedb, newdb);
scanDatabaseForDeletedKeys(activedb, newdb, NULL);
/* Swap hash tables. Note that we don't swap blocking_keys,
* ready_keys and watched_keys, since clients
@ -2639,6 +2680,12 @@ int confAllowsExpireDel(void) {
*/
keyStatus expireIfNeeded(redisDb *db, robj *key, kvobj *kv, int flags) {
debugAssert(key != NULL || kv != NULL);
/* NOTE: Keys in slots scheduled for trimming can still exist for a while.
* If a module touches one of these keys, we remove it right away and
* return KEY_DELETED. */
if (asmActiveTrimDelIfNeeded(db, key, kv)) return KEY_DELETED;
if ((flags & EXPIRE_ALLOW_ACCESS_EXPIRED) ||
(!keyIsExpired(db, key ? key->ptr : NULL, kv)))
return KEY_VALID;
@ -2650,15 +2697,20 @@ keyStatus expireIfNeeded(redisDb *db, robj *key, kvobj *kv, int flags) {
* exception is when write operations are performed on writable
* replicas.
*
* In cluster mode, we also return ASAP if we are importing data
* from the source, to avoid deleting keys that are still in use.
* We create a fake master client for data import, which can be
* identified using the CLIENT_MASTER flag.
*
* Still we try to return the right information to the caller,
* that is, KEY_VALID if we think the key should still be valid,
* KEY_EXPIRED if we think the key is expired but don't want to delete it at this time.
*
* When replicating commands from the master, keys are never considered
* expired. */
if (server.masterhost != NULL) {
if (server.masterhost != NULL || server.cluster_enabled) {
if (server.current_client && (server.current_client->flags & CLIENT_MASTER)) return KEY_VALID;
if (!(flags & EXPIRE_FORCE_DELETE_EXPIRED)) return KEY_EXPIRED;
if (server.masterhost != NULL && !(flags & EXPIRE_FORCE_DELETE_EXPIRED)) return KEY_EXPIRED;
}
/* Check if user configuration disables lazy-expire deletions in current state.
@ -2763,11 +2815,34 @@ kvobj *dbFindExpires(redisDb *db, sds key) {
}
unsigned long long dbSize(redisDb *db) {
return kvstoreSize(db->keys);
unsigned long long total = kvstoreSize(db->keys);
if (server.cluster_enabled) {
/* If we are the master and there is no import or trim in progress,
* then we can return the total count. If not, we need to subtract
* the number of keys in slots that are not accessible, as below. */
if (clusterNodeIsMaster(getMyClusterNode()) &&
!asmImportInProgress() &&
!asmIsTrimInProgress())
{
return total;
}
/* Besides, we don't know the slot migration states on replicas, so we
* need to check each slot to see if it's accessible. */
for (int i = 0; i < CLUSTER_SLOTS; i++) {
dict *d = kvstoreGetDict(db->keys, i);
if (d && !clusterCanAccessKeysInSlot(i)) {
total -= kvstoreDictSize(db->keys, i);
}
}
}
return total;
}
unsigned long long dbScan(redisDb *db, unsigned long long cursor, dictScanFunction *scan_cb, void *privdata) {
return kvstoreScan(db->keys, cursor, -1, scan_cb, NULL, privdata);
return kvstoreScan(db->keys, cursor, -1, scan_cb, scanShouldSkipDict, privdata);
}
/* -----------------------------------------------------------------------------

View file

@ -23,6 +23,7 @@
#include "cluster.h"
#include "threads_mngr.h"
#include "script.h"
#include "cluster_asm.h"
#include <arpa/inet.h>
#include <signal.h>
@ -503,6 +504,12 @@ void debugCommand(client *c) {
" Output SHA and content of all scripts or of a specific script with its SHA.",
"MARK-INTERNAL-CLIENT [UNMARK]",
" Promote the current connection to an internal connection.",
"ASM-FAILPOINT <channel> <state>",
" Set a fail point for the specified channel and state for cluster atomic slot migration.",
"ASM-TRIM-METHOD <default|none|active|bg> <active-trim-delay> ",
" Disable trimming or force active/background trimming for cluster atomic slot migration.",
" Active trim delay is used only when method is 'active'. If it is negative,",
" active trim is disabled.",
NULL
};
addExtendedReplyHelp(c, help, clusterDebugCommandExtendedHelp());
@ -1108,6 +1115,19 @@ NULL
addReplySubcommandSyntaxError(c);
return;
}
} else if(!strcasecmp(c->argv[1]->ptr,"asm-failpoint") && c->argc == 4) {
if (asmDebugSetFailPoint(c->argv[2]->ptr, c->argv[3]->ptr) != C_OK) {
addReplyError(c, "Failed to set ASM fail point");
} else {
addReply(c, shared.ok);
}
} else if(!strcasecmp(c->argv[1]->ptr,"asm-trim-method") && c->argc >= 3) {
int delay = c->argc == 4 ? atoi(c->argv[3]->ptr) : 0;
if (asmDebugSetTrimMethod(c->argv[2]->ptr, delay) != C_OK) {
addReplyError(c, "Failed to set ASM trim method");
} else {
addReply(c, shared.ok);
}
} else if(!handleDebugClusterCommand(c)) {
addReplySubcommandSyntaxError(c);
return;

View file

@ -191,6 +191,26 @@ uint64_t estoreSize(estore *es) {
return es->count;
}
/* Move ebuckets from one estore to another */
void estoreMoveEbuckets(estore *src, estore *dst, int eidx) {
serverAssert(src->num_buckets > eidx);
serverAssert(src->num_buckets == dst->num_buckets);
serverAssert(ebIsEmpty(dst->ebArray[eidx])); /* If it is NULL */
/* Adjust source estore */
ebuckets eb = src->ebArray[eidx];
if (ebIsEmpty(eb)) return;
int64_t count = (int64_t)ebGetTotalItems(eb, src->bucket_type);
src->count -= count;
fwTreeUpdate(src->buckets_sizes, eidx, -count);
src->ebArray[eidx] = ebCreate(); /* Set to NULL actually.*/
/* Move ebuckets to destination estore */
dst->ebArray[eidx] = eb;
dst->count += count;
fwTreeUpdate(dst->buckets_sizes, eidx, count);
}
#ifdef REDIS_TEST
#include <stdio.h>
#include "testhelp.h"

View file

@ -77,6 +77,8 @@ int estoreGetFirstNonEmptyBucket(estore *es);
int estoreGetNextNonEmptyBucket(estore *es, int eidx);
void estoreMoveEbuckets(estore *src, estore *dst, int eidx);
/* Hash-specific function to get ExpireMeta from a hash kvobj.
* Once we shall have another data-type with subexpiry, we should refactor
* ExpireMeta to optionally reside as part of kvobj struct */

View file

@ -14,6 +14,8 @@
#include "bio.h"
#include "atomicvar.h"
#include "script.h"
#include "cluster.h"
#include "cluster_asm.h"
#include <math.h>
/* ----------------------------------------------------------------------------
@ -80,6 +82,12 @@ unsigned long long estimateObjectIdleTime(robj *o) {
}
}
/* During atomic slot migration, keys that are being imported are in an
* intermediate state. We cannot evict them and therefore skip them. */
static int randomEvictionShouldSkipDictIndex(int didx) {
return !clusterCanAccessKeysInSlot(didx);
}
/* LRU approximation algorithm
*
* Redis uses an approximation of the LRU algorithm that runs in constant
@ -127,7 +135,9 @@ int evictionPoolPopulate(redisDb *db, kvstore *samplekvs, struct evictionPoolEnt
int j, k, count;
dictEntry *samples[server.maxmemory_samples];
int slot = kvstoreGetFairRandomDictIndex(samplekvs);
/* Don't retry, since we will call evictionPoolPopulate multiple times if needed. */
int slot = kvstoreGetFairRandomDictIndex(samplekvs, randomEvictionShouldSkipDictIndex, 1, 0);
if (slot == -1) return 0;
count = kvstoreDictGetSomeKeys(samplekvs,slot,samples,server.maxmemory_samples);
for (j = 0; j < count; j++) {
unsigned long long idle;
@ -336,6 +346,11 @@ size_t freeMemoryGetNotCountedMemory(void) {
}
}
/* The migrate client is like a replica, we also push DELs into it when
* evicting keys belonging to the migrating slot, so we don't count its
* output buffer to avoid eviction loop. */
overhead += asmGetMigrateOutputBufferSize();
if (server.aof_state != AOF_OFF) {
overhead += sdsAllocSize(server.aof_buf);
}
@ -459,6 +474,13 @@ static int isSafeToPerformEvictions(void) {
* and just be masters exact copies. */
if (server.masterhost && server.repl_slave_ignore_maxmemory) return 0;
/* Disable eviction during slot migration import to avoid delays and errors
* caused by failed evictions. A special client is created for data import,
* identified by the CLIENT_MASTER and CLIENT_ASM_IMPORTING flags. */
if (server.current_client && server.current_client->flags & CLIENT_MASTER &&
server.current_client->flags & CLIENT_ASM_IMPORTING)
return 0;
/* If 'evict' action is paused, for whatever reason, then return false */
if (isPausedActionsWithUpdate(PAUSE_ACTION_EVICT)) return 0;
@ -556,6 +578,7 @@ int performEvictions(void) {
struct evictionPoolEntry *pool = EvictionPoolLRU;
while (bestkey == NULL) {
unsigned long total_keys = 0;
unsigned long total_sampled_keys = 0;
/* We don't want to make local-db choices when expiring keys,
* so to start populate the eviction pool sampling keys from
@ -577,6 +600,7 @@ int performEvictions(void) {
/* Do not exceed the number of non-empty slots when looping. */
while (l--) {
sampled_keys += evictionPoolPopulate(db, kvs, pool);
total_sampled_keys += sampled_keys;
/* We have sampled enough keys in the current db, exit the loop. */
if (sampled_keys >= (unsigned long) server.maxmemory_samples)
break;
@ -589,6 +613,10 @@ int performEvictions(void) {
}
if (!total_keys) break; /* No keys to evict. */
/* If we iterated all the DBs and all non-empty slot dicts, then
* did not sample any key, stop sampling. */
if (!total_sampled_keys) break;
/* Go backward from best to worst element to evict. */
for (k = EVPOOL_SIZE-1; k >= 0; k--) {
if (pool[k].key == NULL) continue;
@ -636,7 +664,8 @@ int performEvictions(void) {
} else {
kvs = db->expires;
}
int slot = kvstoreGetFairRandomDictIndex(kvs);
int slot = kvstoreGetFairRandomDictIndex(kvs, randomEvictionShouldSkipDictIndex, 16, 0);
if (slot == -1) continue;
de = kvstoreDictGetRandomKey(kvs, slot);
if (de) {
kvobj *kv = dictGetKV(de);

View file

@ -11,6 +11,7 @@
*/
#include "server.h"
#include "cluster.h"
#include "redisassert.h"
/*-----------------------------------------------------------------------------
@ -125,16 +126,21 @@ void expireScanCallback(void *privdata, const dictEntry *de, dictEntryLink plink
data->sampled++;
}
static inline int isExpiryDictValidForSamplingCb(dict *d) {
static inline int expirySamplingShouldSkipDict(dict *d, int didx) {
long long numkeys = dictSize(d);
unsigned long buckets = dictBuckets(d);
/* When there are less than 1% filled buckets, sampling the key
* space is expensive, so stop here waiting for better times...
* The dictionary will be resized asap. */
if (buckets > DICT_HT_INITIAL_SIZE && (numkeys * 100/buckets < 1)) {
return C_ERR;
return 1;
}
return C_OK;
/* During atomic slot migration, keys that are being imported are in an
* intermediate state. we cannot expire them and therefore skip them. */
if (!clusterCanAccessKeysInSlot(didx)) return 1;
return 0;
}
/* SubexpireCtx passed to activeSubexpiresCb() */
@ -243,6 +249,16 @@ static inline void activeSubexpiresCycle(int type) {
if (currentSlot == -1)
currentSlot = estoreGetFirstNonEmptyBucket(db->subexpires);
/* During atomic slot migration, keys that are being imported are in an
* intermediate state. We cannot expire them and therefore skip them. */
if (!clusterCanAccessKeysInSlot(currentSlot)) {
/* Move to next non-empty subexpires slot */
currentSlot = estoreGetNextNonEmptyBucket(db->subexpires, currentSlot);
if (currentSlot == -1)
currentDb = (currentDb + 1) % server.dbnum; /* Move to next db */
return;
}
/* Maximum number of fields to actively expire on a single call */
uint32_t maxToExpire = HFE_DB_BASE_ACTIVE_EXPIRE_FIELDS_PER_SEC / server.hz;
@ -412,7 +428,7 @@ void activeExpireCycle(int type) {
int origin_ttl_samples = data.ttl_samples;
while (data.sampled < num && checked_buckets < max_buckets) {
db->expires_cursor = kvstoreScan(db->expires, db->expires_cursor, -1, expireScanCallback, isExpiryDictValidForSamplingCb, &data);
db->expires_cursor = kvstoreScan(db->expires, db->expires_cursor, -1, expireScanCallback, expirySamplingShouldSkipDict, &data);
if (db->expires_cursor == 0) {
db_done = 1;
break;
@ -640,6 +656,7 @@ int checkAlreadyExpired(long long when) {
*
* Instead we add the already expired key to the database with expire time
* (possibly in the past) and wait for an explicit DEL from the master. */
if (server.current_client && server.current_client->flags & CLIENT_MASTER) return 0;
return (when <= commandTimeSnapshot() && !server.loading && !server.masterhost);
}

View file

@ -670,8 +670,6 @@ void fcallroCommand(client *c) {
}
/*
* FUNCTION DUMP
*
* Returns a binary payload representing all the libraries.
* Can be loaded using FUNCTION RESTORE
*
@ -686,24 +684,32 @@ void fcallroCommand(client *c) {
* The RDB version is saved for backward compatibility.
* crc64 is saved so we can verify the payload content.
*/
void functionDumpCommand(client *c) {
unsigned char buf[2];
void createFunctionDumpPayload(rio *payload) {
uint64_t crc;
rio payload;
rioInitWithBuffer(&payload, sdsempty());
unsigned char buf[2];
rdbSaveFunctions(&payload);
rioInitWithBuffer(payload, sdsempty());
rdbSaveFunctions(payload);
/* RDB version */
buf[0] = RDB_VERSION & 0xff;
buf[1] = (RDB_VERSION >> 8) & 0xff;
payload.io.buffer.ptr = sdscatlen(payload.io.buffer.ptr, buf, 2);
payload->io.buffer.ptr = sdscatlen(payload->io.buffer.ptr, buf, 2);
/* CRC64 */
crc = crc64(0, (unsigned char*) payload.io.buffer.ptr,
sdslen(payload.io.buffer.ptr));
crc = crc64(0, (unsigned char*) payload->io.buffer.ptr,
sdslen(payload->io.buffer.ptr));
memrev64ifbe(&crc);
payload.io.buffer.ptr = sdscatlen(payload.io.buffer.ptr, &crc, 8);
payload->io.buffer.ptr = sdscatlen(payload->io.buffer.ptr, &crc, 8);
}
/*
* FUNCTION DUMP
*/
void functionDumpCommand(client *c) {
rio payload;
createFunctionDumpPayload(&payload);
addReplyBulkSds(c, payload.io.buffer.ptr);
}

View file

@ -122,4 +122,6 @@ int luaEngineInitEngine(void);
int functionsInit(void);
void functionsFree(functionsLibCtx *lib_ctx, dict *engs);
void createFunctionDumpPayload(rio *payload);
#endif /* __FUNCTIONS_H_ */

View file

@ -82,8 +82,8 @@ void unbindClientFromIOThreadEventLoop(client *c) {
* we should unbind connection of client from io thread event loop first,
* and then bind the client connection into server's event loop. */
void keepClientInMainThread(client *c) {
serverAssert(c->tid != IOTHREAD_MAIN_THREAD_ID &&
c->running_tid == IOTHREAD_MAIN_THREAD_ID);
if (c->tid == IOTHREAD_MAIN_THREAD_ID) return;
serverAssert(c->running_tid == IOTHREAD_MAIN_THREAD_ID);
/* IO thread no longer manage it. */
server.io_threads_clients_num[c->tid]--;
/* Unbind connection of client from io thread event loop. */
@ -146,7 +146,8 @@ int isClientMustHandledByMainThread(client *c) {
if (c->flags & (CLIENT_CLOSE_ASAP | CLIENT_MASTER | CLIENT_SLAVE |
CLIENT_PUBSUB | CLIENT_MONITOR | CLIENT_BLOCKED |
CLIENT_UNBLOCKED | CLIENT_TRACKING | CLIENT_LUA_DEBUG |
CLIENT_LUA_DEBUG_SYNC))
CLIENT_LUA_DEBUG_SYNC | CLIENT_ASM_MIGRATING |
CLIENT_ASM_IMPORTING))
{
return 1;
}

View file

@ -385,7 +385,7 @@ unsigned long long kvstoreScan(kvstore *kvs, unsigned long long cursor,
dict *d = kvstoreGetDict(kvs, didx);
int skip = !d || (skip_cb && skip_cb(d));
int skip = !d || (skip_cb && skip_cb(d, didx));
if (!skip) {
_cursor = dictScan(d, cursor, scan_cb, privdata);
/* In dictScan, scan_cb may delete entries (e.g., in active expire case). */
@ -427,12 +427,56 @@ int kvstoreExpand(kvstore *kvs, uint64_t newsize, int try_expand, kvstoreExpandS
return 1;
}
/* Returns fair random dict index, probability of each dict being returned is proportional to the number of elements that dictionary holds.
* This function guarantees that it returns a dict-index of a non-empty dict, unless the entire kvstore is empty.
* Time complexity of this function is O(log(kvs->num_dicts)). */
int kvstoreGetFairRandomDictIndex(kvstore *kvs) {
unsigned long target = kvstoreSize(kvs) ? (randomULong() % kvstoreSize(kvs)) + 1 : 0;
return kvstoreFindDictIndexByKeyIndex(kvs, target);
/* Returns fair random dict index, probability of each dict being returned is
* proportional to the number of elements that dictionary holds.
* This function guarantees that it returns a dict-index of a non-empty dict,
* unless the entire kvstore is empty or all dicts are skipped.
*
* Parameters:
* - kvs: the kvstore instance
* - skip_cb: callback to determine if a dict should be skipped (NULL means no skipping)
* - fair_attempts: number of fair selection attempts before falling back
* - slow_fallback: if 1, uses systematic search when fair attempts fail
*
* Returns:
* - Valid dict index (>= 0) on success
* - -1 if no valid dict found (either slow_fallback is 0 or all dicts are skipped)
*
* Time complexity: O(fair_attempts * log(kvs->num_dicts)) for fair attempts,
* plus O(kvs->num_dicts) for systematic fallback if enabled.
*/
int kvstoreGetFairRandomDictIndex(kvstore *kvs, kvstoreRandomShouldSkipDictIndex *skip_cb,
int fair_attempts, int slow_fallback)
{
if (kvs->num_dicts == 1 || kvstoreSize(kvs) == 0)
return 0;
unsigned long long total_size = kvstoreSize(kvs);
/* Try fair attempts first. If skip_cb is not applicable, execute only once. */
for (int attempt = 0; attempt < fair_attempts; attempt++) {
unsigned long target = (randomULong() % total_size) + 1;
int didx = kvstoreFindDictIndexByKeyIndex(kvs, target);
if (!skip_cb || !skip_cb(didx)) {
return didx;
}
}
/* If fair attempts failed and slow fallback is allowed */
if (slow_fallback) {
/* systematic check from random start */
int start = randomULong() % kvs->num_dicts;
for (int i = 0; i < kvs->num_dicts; i++) {
int didx = (start + i) % kvs->num_dicts;
dict *d = kvstoreGetDict(kvs, didx);
if (d && (!skip_cb || !skip_cb(didx))) {
return didx;
}
}
}
/* Failed to find valid dict that has elements */
return -1;
}
void kvstoreGetStats(kvstore *kvs, char *buf, size_t bufsize, int full) {
@ -535,6 +579,37 @@ int kvstoreNumDicts(kvstore *kvs) {
return kvs->num_dicts;
}
/* Move dict from one kvstore to another. */
void kvstoreMoveDict(kvstore *kvs, kvstore *dst, int didx) {
serverAssert(kvs->num_dicts > didx);
serverAssert(kvs->num_dicts == dst->num_dicts);
serverAssert(dst->dicts[didx] == NULL);
dict *d = kvs->dicts[didx];
if (d == NULL) return;
/* Adjust source kvstore */
kvs->allocated_dicts -= 1;
cumulativeKeyCountAdd(kvs, didx, -((long long)dictSize(d)));
kvstoreDictBucketChanged(d, -((long long) dictBuckets(d)));
/* If rehashing, stop it. */
if (dictIsRehashing(d))
kvstoreDictRehashingCompleted(d);
/* Clear dict from source kvstore and create a new one if needed */
kvs->dicts[didx] = NULL;
if (!(kvs->flags & (KVSTORE_ALLOCATE_DICTS_ON_DEMAND | KVSTORE_FREE_EMPTY_DICTS)))
createDictIfNeeded(kvs, didx);
/* Move dict to destination kvstore */
dst->dicts[didx] = d;
dst->dicts[didx]->type = &dst->dtype;
dst->allocated_dicts += 1;
cumulativeKeyCountAdd(dst, didx, dictSize(d));
kvstoreDictBucketChanged(d, dictBuckets(d));
if (dictIsRehashing(dst->dicts[didx]))
kvstoreDictRehashingStarted(dst->dicts[didx]);
}
/* Returns kvstore iterator that can be used to iterate through sub-dictionaries.
*
* The caller should free the resulting kvs_it with kvstoreIteratorRelease. */
@ -733,7 +808,7 @@ unsigned int kvstoreDictGetSomeKeys(kvstore *kvs, int didx, dictEntry **des, uns
int kvstoreDictExpand(kvstore *kvs, int didx, unsigned long size)
{
dict *d = kvstoreGetDict(kvs, didx);
dict *d = createDictIfNeeded(kvs, didx);
if (!d)
return DICT_ERR;
return dictExpand(d, size);

View file

@ -48,8 +48,9 @@ typedef struct _kvstore kvstore;
typedef struct _kvstoreIterator kvstoreIterator;
typedef struct _kvstoreDictIterator kvstoreDictIterator;
typedef int (kvstoreScanShouldSkipDict)(dict *d);
typedef int (kvstoreScanShouldSkipDict)(dict *d, int didx);
typedef int (kvstoreExpandShouldSkipDictIndex)(int didx);
typedef int (kvstoreRandomShouldSkipDictIndex)(int didx);
#define KVSTORE_ALLOCATE_DICTS_ON_DEMAND (1<<0)
#define KVSTORE_FREE_EMPTY_DICTS (1<<1)
@ -65,7 +66,8 @@ unsigned long long kvstoreScan(kvstore *kvs, unsigned long long cursor,
kvstoreScanShouldSkipDict *skip_cb,
void *privdata);
int kvstoreExpand(kvstore *kvs, uint64_t newsize, int try_expand, kvstoreExpandShouldSkipDictIndex *skip_cb);
int kvstoreGetFairRandomDictIndex(kvstore *kvs);
int kvstoreGetFairRandomDictIndex(kvstore *kvs, kvstoreExpandShouldSkipDictIndex *skip_cb,
int fair_attempts, int slow_fallback);
void kvstoreGetStats(kvstore *kvs, char *buf, size_t bufsize, int full);
int kvstoreFindDictIndexByKeyIndex(kvstore *kvs, unsigned long target);
@ -74,6 +76,7 @@ int kvstoreGetNextNonEmptyDictIndex(kvstore *kvs, int didx);
int kvstoreNumNonEmptyDicts(kvstore *kvs);
int kvstoreNumAllocatedDicts(kvstore *kvs);
int kvstoreNumDicts(kvstore *kvs);
void kvstoreMoveDict(kvstore *kvs, kvstore *dst, int didx);
/* kvstore iterator specific functions */
kvstoreIterator *kvstoreIteratorInit(kvstore *kvs);

View file

@ -210,8 +210,13 @@ void emptyDbAsync(redisDb *db) {
db->keys = kvstoreCreate(&dbDictType, slot_count_bits, flags | KVSTORE_ALLOC_META_KEYS_HIST);
db->expires = kvstoreCreate(&dbExpiresDictType, slot_count_bits, flags);
db->subexpires = estoreCreate(&subexpiresBucketsType, slot_count_bits);
atomicIncr(lazyfree_objects, kvstoreSize(oldkeys));
bioCreateLazyFreeJob(lazyfreeFreeDatabase, 3, oldkeys, oldexpires, oldsubexpires);
emptyDbDataAsync(oldkeys, oldexpires, oldsubexpires);
}
/* Empty a Redis DB data asynchronously. */
void emptyDbDataAsync(kvstore *keys, kvstore *expires, ebuckets hexpires) {
atomicIncr(lazyfree_objects, kvstoreSize(keys));
bioCreateLazyFreeJob(lazyfreeFreeDatabase, 3, keys, expires, hexpires);
}
/* Free the key tracking table.

View file

@ -38,6 +38,7 @@
#include "server.h"
#include "cluster.h"
#include "cluster_asm.h"
#include "slowlog.h"
#include "rdb.h"
#include "monotonic.h"
@ -93,6 +94,7 @@ struct AutoMemEntry {
#define REDISMODULE_AM_DICT 4
#define REDISMODULE_AM_INFO 5
#define REDISMODULE_AM_CONFIG 6
#define REDISMODULE_AM_SLOTRANGEARRAY 7
/* The pool allocator block. Redis Modules can allocate memory via this special
* allocator that will automatically release it all once the callback returns.
@ -497,6 +499,7 @@ static void moduleInitKeyTypeSpecific(RedisModuleKey *key);
void RM_FreeDict(RedisModuleCtx *ctx, RedisModuleDict *d);
void RM_FreeServerInfo(RedisModuleCtx *ctx, RedisModuleServerInfoData *data);
void RM_ConfigIteratorRelease(RedisModuleCtx *ctx, RedisModuleConfigIterator *iter);
void RM_ClusterFreeSlotRanges(RedisModuleCtx *ctx, RedisModuleSlotRangeArray *slots);
/* Helpers for RM_SetCommandInfo. */
static int moduleValidateCommandInfo(const RedisModuleCommandInfo *info);
@ -2621,6 +2624,7 @@ void autoMemoryCollect(RedisModuleCtx *ctx) {
case REDISMODULE_AM_DICT: RM_FreeDict(NULL,ptr); break;
case REDISMODULE_AM_INFO: RM_FreeServerInfo(NULL,ptr); break;
case REDISMODULE_AM_CONFIG: RM_ConfigIteratorRelease(NULL, ptr); break;
case REDISMODULE_AM_SLOTRANGEARRAY: RM_ClusterFreeSlotRanges(NULL, ptr); break;
}
}
ctx->flags |= REDISMODULE_CTX_AUTO_MEMORY;
@ -8817,9 +8821,11 @@ void moduleReleaseGIL(void) {
* - REDISMODULE_NOTIFY_NEW: New key notification
* - REDISMODULE_NOTIFY_OVERWRITTEN: Overwritten events
* - REDISMODULE_NOTIFY_TYPE_CHANGED: Type-changed events
* - REDISMODULE_NOTIFY_KEY_TRIMMED: Key trimmed events after a slot migration operation
* - REDISMODULE_NOTIFY_ALL: All events (Excluding REDISMODULE_NOTIFY_KEYMISS,
* REDISMODULE_NOTIFY_NEW, REDISMODULE_NOTIFY_OVERWRITTEN
* and REDISMODULE_NOTIFY_TYPE_CHANGED)
* REDISMODULE_NOTIFY_NEW, REDISMODULE_NOTIFY_OVERWRITTEN,
* REDISMODULE_NOTIFY_TYPE_CHANGED
* and REDISMODULE_NOTIFY_KEY_TRIMMED)
* - REDISMODULE_NOTIFY_LOADED: A special notification available only for modules,
* indicates that the key was loaded from persistence.
* Notice, when this event fires, the given key
@ -8901,6 +8907,18 @@ int RM_UnsubscribeFromKeyspaceEvents(RedisModuleCtx *ctx, int types, RedisModule
return removed > 0 ? REDISMODULE_OK : REDISMODULE_ERR;
}
/* Check any subscriber for event */
int moduleHasSubscribersForKeyspaceEvent(int type) {
listIter li;
listNode *ln;
listRewind(moduleKeyspaceSubscribers,&li);
while((ln = listNext(&li))) {
RedisModuleKeyspaceSubscriber *sub = ln->value;
if (sub->event_mask & type) return 1;
}
return 0;
}
void firePostExecutionUnitJobs(void) {
/* Avoid propagation of commands.
* In that way, postExecutionUnitOperations will prevent
@ -9021,8 +9039,10 @@ void moduleNotifyKeyspaceEvent(int type, const char *event, robj *key, int dbid)
int prev_active = sub->active;
sub->active = 1;
server.allow_access_expired++;
server.allow_access_trimmed++;
sub->notify_callback(&ctx, type, event, key);
server.allow_access_expired--;
server.allow_access_trimmed--;
sub->active = prev_active;
moduleFreeContext(&ctx);
}
@ -9293,6 +9313,101 @@ const char *RM_ClusterCanonicalKeyNameInSlot(unsigned int slot) {
return (slot < CLUSTER_SLOTS) ? crc16_slot_table[slot] : NULL;
}
/* Returns 1 if keys in the specified slot can be accessed by this node, 0 otherwise.
*
* This function returns 1 in the following cases:
* - The slot is owned by this node or by its master if this node is a replica
* - The slot is being imported under the old slot migration approach (CLUSTER SETSLOT <slot> IMPORTING ..)
* - Not in cluster mode (all slots are accessible)
*
* Returns 0 for:
* - Invalid slot numbers (< 0 or >= 16384)
* - Slots owned by other nodes
*/
int RM_ClusterCanAccessKeysInSlot(int slot) {
if (slot < 0 || slot >= CLUSTER_SLOTS) return 0;
return clusterCanAccessKeysInSlot(slot);
}
/* Propagate commands along with slot migration.
*
* This function allows modules to add commands that will be sent to the
* destination node before the actual slot migration begins. It should only be
* called during the REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE event.
*
* This function can be called multiple times within the same event to
* replicate multiple commands. All commands will be sent before the
* actual slot data migration begins.
*
* Note: This function is only available in the fork child process just before
* slot snapshot delivery begins.
*
* On success REDISMODULE_OK is returned, otherwise
* REDISMODULE_ERR is returned and errno is set to the following values:
*
* * EINVAL: function arguments or format specifiers are invalid.
* * EBADF: not called in the correct context, e.g. not called in the REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE event.
* * ENOENT: command does not exist.
* * ENOTSUP: command is cross-slot.
* * ERANGE: command contains keys that are not within the migrating slot range.
*/
int RM_ClusterPropagateForSlotMigration(RedisModuleCtx *ctx, const char *cmdname, const char *fmt, ...) {
int argc = 0, flags = 0;
robj **argv = NULL;
struct redisCommand *cmd;
va_list ap;
if (ctx == NULL || cmdname == NULL || fmt == NULL) {
errno = EINVAL;
return REDISMODULE_ERR;
}
errno = 0;
cmd = lookupCommandByCString((char*)cmdname);
if (!cmd) {
errno = ENOENT;
return REDISMODULE_ERR;
}
va_start(ap, fmt);
argv = moduleCreateArgvFromUserFormat(cmdname, fmt, &argc, &flags, ap);
va_end(ap);
if (argv == NULL) {
errno = EINVAL;
return REDISMODULE_ERR;
}
int ret = asmModulePropagateBeforeSlotSnapshot(cmd, argv, argc);
int saved_errno = errno;
/* Release the argv. */
for (int i = 0; i < argc; i++) decrRefCount(argv[i]);
zfree(argv);
errno = saved_errno;
return ret == C_OK ? REDISMODULE_OK : REDISMODULE_ERR;
}
/* Returns the locally owned slot ranges for the node.
*
* An optional `ctx` can be provided to enable auto-memory management.
* If cluster mode is disabled, the array will include all slots (016383).
* If the node is a replica, the slot ranges of its master are returned.
*
* The returned array must be freed with RM_ClusterFreeSlotRanges().
*/
RedisModuleSlotRangeArray *RM_ClusterGetLocalSlotRanges(RedisModuleCtx *ctx) {
slotRangeArray *slots = clusterGetLocalSlotRanges();
if (ctx) autoMemoryAdd(ctx, REDISMODULE_AM_SLOTRANGEARRAY, slots);
return (RedisModuleSlotRangeArray *)slots;
}
/* Frees a slot range array returned by RM_ClusterGetLocalSlotRanges().
* Pass the `ctx` pointer only if the array was created with a context. */
void RM_ClusterFreeSlotRanges(RedisModuleCtx *ctx, RedisModuleSlotRangeArray *slots) {
if (ctx) autoMemoryFreed(ctx, REDISMODULE_AM_SLOTRANGEARRAY, slots);
slotRangeArrayFree((slotRangeArray *)slots);
}
/* --------------------------------------------------------------------------
* ## Modules Timers API
*
@ -11600,6 +11715,8 @@ static uint64_t moduleEventVersions[] = {
-1, /* REDISMODULE_EVENT_EVENTLOOP */
-1, /* REDISMODULE_EVENT_CONFIG */
REDISMODULE_KEYINFO_VERSION, /* REDISMODULE_EVENT_KEY */
REDISMODULE_CLUSTER_SLOT_MIGRATION_INFO_VERSION, /* REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION */
REDISMODULE_CLUSTER_SLOT_MIGRATION_TRIMINFO_VERSION, /* REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM */
};
/* Register to be notified, via a callback, when the specified server event
@ -11890,6 +12007,63 @@ static uint64_t moduleEventVersions[] = {
*
* RedisModuleKey *key; // Key name
*
* * RedisModuleEvent_ClusterSlotMigration
*
* Called when an atomic slot migration (ASM) event happens.
* IMPORT events are triggered on the destination side of a slot migration
* operation. These notifications let modules prepare for the upcoming
* ownership change, observe successful completion once the cluster config
* reflects the new owner, or detect a failure in which case slot ownership
* remains with the source.
*
* Similarly, MIGRATE events triggered on the source side of a slot
* migration operation to let modules prepare for the ownership change and
* observe the completion of the slot migration. MIGRATE_MODULE_PROPAGATE
* event is triggered in the fork just before snapshot delivery; modules may
* use it to enqueue commands that will be delivered first. See
* RedisModule_ClusterPropagateForSlotMigration() for details.
*
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_STARTED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_FAILED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_COMPLETED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_STARTED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_FAILED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_COMPLETED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE`
*
* The data pointer can be casted to a RedisModuleClusterSlotMigrationInfo
* structure with the following fields:
*
* char source_node_id[REDISMODULE_NODE_ID_LEN + 1];
* char destination_node_id[REDISMODULE_NODE_ID_LEN + 1];
* const char *task_id; // Task ID
* RedisModuleSlotRangeArray *slots; // Slot ranges
*
* * RedisModuleEvent_ClusterSlotMigrationTrim
*
* Called when trimming keys after a slot migration. Fires on the source
* after a successful migration to clean up migrated keys, or on the
* destination after a failed import to discard partial imports. Two methods
* are supported. In the first method, keys are deleted in a background
* thread; this is reported via the TRIM_BACKGROUND event. In the second
* method, Redis performs incremental deletions on the main thread via the
* cron loop to avoid stalls; this is reported via the TRIM_STARTED and
* TRIM_COMPLETED events. Each deletion emits REDISMODULE_NOTIFY_KEY_TRIMMED
* so modules can react to individual key deletions. Redis selects the
* method automatically: background by default; switches to main thread
* trimming when a module subscribes to REDISMODULE_NOTIFY_KEY_TRIMMED.
*
* The following sub events are available:
*
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_STARTED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_COMPLETED`
* * `REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_BACKGROUND`
*
* The data pointer can be casted to a RedisModuleClusterSlotMigrationTrimInfo
* structure with the following fields:
*
* RedisModuleSlotRangeArray *slots; // Slot ranges
*
* The function returns REDISMODULE_OK if the module was successfully subscribed
* for the specified event. If the API is called from a wrong context or unsupported event
* is given then REDISMODULE_ERR is returned. */
@ -11971,6 +12145,10 @@ int RM_IsSubEventSupported(RedisModuleEvent event, int64_t subevent) {
return subevent < _REDISMODULE_SUBEVENT_CONFIG_NEXT;
case REDISMODULE_EVENT_KEY:
return subevent < _REDISMODULE_SUBEVENT_KEY_NEXT;
case REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION:
return subevent < _REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_NEXT;
case REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM:
return subevent < _REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_NEXT;
default:
break;
}
@ -12058,6 +12236,10 @@ void moduleFireServerEvent(uint64_t eid, int subid, void *data) {
selectDb(ctx.client, info->dbnum);
moduleInitKey(&key, &ctx, info->key, info->kv, info->mode);
moduledata = &ki;
} else if (eid == REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION) {
moduledata = data;
} else if (eid == REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM) {
moduledata = data;
}
el->module->in_hook++;
@ -12115,6 +12297,7 @@ void processModuleLoadingProgressEvent(int is_aof) {
* will be called to tell the module which key is about to be released. */
void moduleNotifyKeyUnlink(robj *key, kvobj *kv, int dbid, int flags) {
server.allow_access_expired++;
server.allow_access_trimmed++;
int subevent = REDISMODULE_SUBEVENT_KEY_DELETED;
if (flags & DB_FLAG_KEY_EXPIRED) {
subevent = REDISMODULE_SUBEVENT_KEY_EXPIRED;
@ -12138,6 +12321,7 @@ void moduleNotifyKeyUnlink(robj *key, kvobj *kv, int dbid, int flags) {
}
}
server.allow_access_expired--;
server.allow_access_trimmed--;
}
/* Return the free_effort of the module, it will automatically choose to call
@ -14786,6 +14970,10 @@ void moduleRegisterCoreAPI(void) {
REGISTER_API(SetClusterFlags);
REGISTER_API(ClusterKeySlot);
REGISTER_API(ClusterCanonicalKeyNameInSlot);
REGISTER_API(ClusterCanAccessKeysInSlot);
REGISTER_API(ClusterPropagateForSlotMigration);
REGISTER_API(ClusterGetLocalSlotRanges);
REGISTER_API(ClusterFreeSlotRanges);
REGISTER_API(CreateDict);
REGISTER_API(FreeDict);
REGISTER_API(DictSize);

View file

@ -8,6 +8,7 @@
*/
#include "server.h"
#include "cluster.h"
/* ================================ MULTI/EXEC ============================== */
@ -405,13 +406,15 @@ void touchWatchedKey(redisDb *db, robj *key) {
/* Set CLIENT_DIRTY_CAS to all clients of DB when DB is dirty.
* It may happen in the following situations:
* FLUSHDB, FLUSHALL, SWAPDB, end of successful diskless replication.
* - FLUSHDB, FLUSHALL, SWAPDB, end of successful diskless replication.
* - Atomic slot migration trimming phase. In this case, 'slots' is set and only
* keys in the specified slots are touched.
*
* replaced_with: for SWAPDB, the WATCH should be invalidated if
* the key exists in either of them, and skipped only if it
* doesn't exist in both. */
REDIS_NO_SANITIZE("thread")
void touchAllWatchedKeysInDb(redisDb *emptied, redisDb *replaced_with) {
void touchAllWatchedKeysInDb(redisDb *emptied, redisDb *replaced_with, struct slotRangeArray *slots) {
listIter li;
listNode *ln;
dictEntry *de;
@ -422,6 +425,8 @@ void touchAllWatchedKeysInDb(redisDb *emptied, redisDb *replaced_with) {
dictInitSafeIterator(&di, emptied->watched_keys);
while((de = dictNext(&di)) != NULL) {
robj *key = dictGetKey(de);
if (slots && !slotRangeArrayContains(slots, keyHashSlot(key->ptr, sdslen(key->ptr))))
continue;
int exists_in_emptied = dbFind(emptied, key->ptr) != NULL;
if (exists_in_emptied ||
(replaced_with && dbFind(replaced_with, key->ptr) != NULL))

View file

@ -19,6 +19,7 @@
#include "script.h"
#include "fpconv_dtoa.h"
#include "fmtargs.h"
#include "cluster_asm.h"
#include <sys/socket.h>
#include <sys/uio.h>
#include <math.h>
@ -231,6 +232,8 @@ client *createClient(connection *conn) {
c->net_input_bytes = 0;
c->net_output_bytes = 0;
c->commands_processed = 0;
c->task = NULL;
c->node_id = NULL;
return c;
}
@ -564,8 +567,13 @@ void afterErrorReply(client *c, const char *s, size_t len, int flags) {
to = "AOF-loading-client";
from = "server";
} else if (ctype == CLIENT_TYPE_MASTER) {
to = "master";
from = "replica";
if (c->flags & CLIENT_ASM_IMPORTING) {
to = "source";
from = "destination";
} else {
to = "master";
from = "replica";
}
} else {
to = "replica";
from = "master";
@ -577,7 +585,7 @@ void afterErrorReply(client *c, const char *s, size_t len, int flags) {
"to its %s: '%.*s' after processing the command "
"'%s'", from, to, (int)len, s, cmdname ? cmdname : "<unknown>");
if (ctype == CLIENT_TYPE_MASTER && server.repl_backlog &&
server.repl_backlog->histlen > 0)
!(c->flags & CLIENT_ASM_IMPORTING) && server.repl_backlog->histlen > 0)
{
showLatestBacklog();
}
@ -1755,6 +1763,8 @@ void freeClient(client *c) {
c);
}
asmCallbackOnFreeClient(c);
/* Notify module system that this client auth status changed. */
moduleNotifyUserChanged(c);
@ -1899,6 +1909,7 @@ void freeClient(client *c) {
sdsfree(c->peerid);
sdsfree(c->sockname);
sdsfree(c->slave_addr);
sdsfree(c->node_id);
zfree(c);
}
@ -2164,8 +2175,8 @@ int writeToClient(client *c, int handler_installed) {
atomicIncr(server.stat_net_repl_output_bytes, totwritten);
} else {
/* If we reach this block and client is marked with CLIENT_SLAVE flag
* it's because it's a MONITOR client, which are marked as replicas,
* but exposed as normal clients */
* it's because it's a MONITOR/slot-migration client, which are marked
* as replicas, but exposed as normal clients */
const int is_normal_client = !(c->flags & CLIENT_SLAVE);
while (_clientHasPendingRepliesNonSlave(c)) {
int ret = _writeToClientNonSlave(c, &nwritten);
@ -2751,6 +2762,10 @@ void commandProcessed(client *c) {
if (applied) {
replicationFeedStreamFromMasterStream(c->querybuf+c->repl_applied,applied);
c->repl_applied += applied;
/* Update the atomic slot migration task's applied bytes. */
if (c->flags & CLIENT_ASM_IMPORTING)
asmImportIncrAppliedBytes(c->task, applied);
}
}
}
@ -2955,7 +2970,9 @@ int processInputBuffer(client *c) {
* thread handle. To avoid memory prefetching on an invalid command. */
c->iolookedcmd = NULL;
}
c->slot = getSlotFromCommand(c->iolookedcmd, c->argv, c->argc);
int slot = getSlotFromCommand(c->iolookedcmd, c->argv, c->argc);
/* Reset to -1, since c->slot expects -1 if no slot is being used */
c->slot = (slot == GETSLOT_CROSSSLOT || slot == GETSLOT_NOKEYS) ? -1 : slot;
enqueuePendingClientsToMainThread(c, 0);
break;
}
@ -3215,10 +3232,17 @@ sds catClientInfoString(sds s, client *client) {
if (client->flags & CLIENT_SLAVE) {
if (client->flags & CLIENT_MONITOR)
*p++ = 'O';
else if (client->flags & CLIENT_ASM_MIGRATING)
*p++ = 'g';
else
*p++ = 'S';
}
if (client->flags & CLIENT_MASTER) *p++ = 'M';
if (client->flags & CLIENT_MASTER) {
if (client->flags & CLIENT_ASM_IMPORTING)
*p++ = 'o';
else
*p++ = 'M';
}
if (client->flags & CLIENT_PUBSUB) *p++ = 'P';
if (client->flags & CLIENT_MULTI) *p++ = 'x';
if (client->flags & CLIENT_BLOCKED) *p++ = 'b';
@ -4267,6 +4291,14 @@ size_t getClientOutputBufferMemoryUsage(client *c) {
}
}
size_t getNormalClientPendingReplyBytes(client *c) {
serverAssert(!clientTypeIsSlave(c));
if (listLength(c->reply) == 0) return c->bufpos;
clientReplyBlock *block = listNodeValue(listLast(c->reply));
return (c->reply_bytes - block->size + block->used) + c->bufpos;
}
/* Returns the total client's memory usage.
* Optionally, if output_buffer_mem_usage is not NULL, it fills it with
* the client output buffer memory usage portion of the total. */
@ -4315,10 +4347,13 @@ int getClientType(client *c) {
}
static inline int clientTypeIsSlave(client *c) {
/* Even though MONITOR clients are marked as replicas, we
* want the expose them as normal clients. */
if (unlikely((c->flags & CLIENT_SLAVE) && !(c->flags & CLIENT_MONITOR)))
/* Even though MONITOR clients and ASM destination RDB/main channels are
* marked as replicas, we want to expose them as normal clients. */
if (unlikely((c->flags & CLIENT_SLAVE) &&
!(c->flags & (CLIENT_MONITOR | CLIENT_ASM_MIGRATING))))
{
return 1;
}
return 0;
}
@ -4529,7 +4564,10 @@ static void pauseClientsByClient(mstime_t endTime, int isPauseClientAll) {
if (p->paused_actions & PAUSE_ACTION_CLIENT_ALL)
actions = PAUSE_ACTIONS_CLIENT_ALL_SET;
}
/* Cancel all ASM tasks when starting client pause */
clusterAsmCancel(NULL, "client pause requested");
pauseActions(PAUSE_BY_CLIENT_COMMAND, endTime, actions);
}
@ -4563,6 +4601,13 @@ void pauseActions(pause_purpose purpose, mstime_t end, uint32_t actions) {
if (server.in_exec) {
server.client_pause_in_transaction = 1;
}
/* Assert that there is no import task in progress when we are pausing.
* otherwise we break the promise that no writes are performed, maybe
* causing data lost during a failover. */
if (isPausedActions(PAUSE_ACTION_CLIENT_ALL) ||
isPausedActions(PAUSE_ACTION_CLIENT_WRITE))
serverAssert(!asmImportInProgress());
}
/* Unpause actions and queue them for reprocessing. */

View file

@ -14,6 +14,7 @@
#include "server.h"
#include "functions.h"
#include "intset.h" /* Compact integer set structure */
#include "cluster_asm.h"
#include <math.h>
#include <ctype.h>
@ -1470,6 +1471,12 @@ struct redisMemOverhead *getMemoryOverheadData(void) {
mh->script_vm += functionsMemoryVM();
mem_total+=mh->script_vm;
/* Cluster atomic slot migration buffers. */
mh->asm_import_input_buffer = asmGetImportInputBufferSize();
mh->asm_migrate_output_buffer = asmGetMigrateOutputBufferSize();
mem_total += mh->asm_import_input_buffer;
mem_total += mh->asm_migrate_output_buffer;
for (j = 0; j < server.dbnum; j++) {
redisDb *db = server.db+j;
if (!kvstoreNumAllocatedDicts(db->keys)) continue;

View file

@ -21,6 +21,7 @@
#include "functions.h"
#include "intset.h" /* Compact integer set structure */
#include "bio.h"
#include "cluster_asm.h"
#include <math.h>
#include <fcntl.h>
@ -1280,6 +1281,16 @@ int rdbSaveInfoAuxFields(rio *rdb, int rdbflags, rdbSaveInfo *rsi) {
== -1) return -1;
}
if (rdbSaveAuxFieldStrInt(rdb, "aof-base", aof_base) == -1) return -1;
/* Save the active import ASM task if cluster is enabled. */
if (server.cluster_enabled) {
sds task_info = asmDumpActiveImportTask();
int ret = rdbSaveAuxFieldStrStr(rdb, "cluster-asm-task",
task_info ? task_info : "");
if (task_info) sdsfree(task_info);
if (ret == -1) return -1;
}
return 1;
}
@ -1369,7 +1380,7 @@ werr:
return -1;
}
ssize_t rdbSaveDb(rio *rdb, int dbid, int rdbflags, long *key_counter) {
ssize_t rdbSaveDb(rio *rdb, int dbid, int rdbflags, long *key_counter, unsigned long long *skipped) {
dictEntry *de;
ssize_t written = 0;
ssize_t res;
@ -1418,6 +1429,12 @@ ssize_t rdbSaveDb(rio *rdb, int dbid, int rdbflags, long *key_counter) {
long long expire;
size_t rdb_bytes_before_key = rdb->processed_bytes;
/* Skip keys that are being trimmed */
if (server.cluster_enabled && isSlotInTrimJob(curr_slot)) {
(*skipped)++;
continue;
}
initStaticStringObject(key,kvobjGetKey(kv));
expire = kvobjGetExpire(kv);
if ((res = rdbSaveKeyValuePair(rdb, &key, kv, expire, dbid)) < 0) goto werr;
@ -1460,6 +1477,7 @@ int rdbSaveRio(int req, rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi) {
char magic[10];
uint64_t cksum;
long key_counter = 0;
unsigned long long skipped = 0;
int j;
if (server.rdb_checksum)
@ -1475,7 +1493,7 @@ int rdbSaveRio(int req, rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi) {
/* save all databases, skip this if we're in functions-only mode */
if (!(req & SLAVE_REQ_RDB_EXCLUDE_DATA)) {
for (j = 0; j < server.dbnum; j++) {
if (rdbSaveDb(rdb, j, rdbflags, &key_counter) == -1) goto werr;
if (rdbSaveDb(rdb, j, rdbflags, &key_counter, &skipped) == -1) goto werr;
}
}
@ -1489,6 +1507,7 @@ int rdbSaveRio(int req, rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi) {
cksum = rdb->cksum;
memrev64ifbe(&cksum);
if (rioWrite(rdb,&cksum,8) == 0) goto werr;
serverLog(LL_NOTICE, "BGSAVE done, %ld keys saved, %llu keys skipped, %zu bytes written.", key_counter, skipped, rdb->processed_bytes);
return C_OK;
werr:
@ -3494,6 +3513,8 @@ int rdbLoadRioWithLoadingCtx(rio *rdb, int rdbflags, rdbSaveInfo *rsi, rdbLoadin
} else if (!strcasecmp(auxkey->ptr, "aof-base")) {
long long isbase = strtoll(auxval->ptr, NULL, 10);
if (isbase) serverLog(LL_NOTICE, "RDB is base AOF");
} else if (!strcasecmp(auxkey->ptr,"cluster-asm-task")) {
asmReplicaHandleMasterTask(auxval->ptr);
} else if (!strcasecmp(auxkey->ptr,"redis-bits")) {
/* Just ignored. */
} else {
@ -3723,6 +3744,8 @@ int rdbLoad(char *filename, rdbSaveInfo *rsi, int rdbflags) {
return rdbLoadWithEmptyFunc(filename, rsi, rdbflags, NULL);
}
int slotSnapshotSaveRio(int req, rio *rdb, int *error);
/* Like rdbLoadRio() but takes a filename instead of a rio stream. The
* filename is open for reading and a rio stream object created in order
* to do the actual loading. Moreover the ETA displayed in the INFO
@ -3887,6 +3910,7 @@ int rdbSaveToSlavesSockets(int req, rdbSaveInfo *rsi) {
pid_t childpid;
int pipefds[2], rdb_pipe_write = 0, safe_to_exit_pipe = 0;
int rdb_channel = server.repl_rdb_channel && (req & SLAVE_REQ_RDB_CHANNEL);
int slots_req = req & SLAVE_REQ_SLOTS_SNAPSHOT;
if (hasActiveChildProcess()) return C_ERR;
@ -3959,7 +3983,13 @@ int rdbSaveToSlavesSockets(int req, rdbSaveInfo *rsi) {
redisSetProcTitle("redis-rdb-to-slaves");
redisSetCpuAffinity(server.bgsave_cpulist);
retval = rdbSaveRioWithEOFMark(req,&rdb,NULL,rsi);
if (req & SLAVE_REQ_SLOTS_SNAPSHOT) {
/* Slots snapshot is required */
retval = slotSnapshotSaveRio(req, &rdb, NULL);
} else {
retval = rdbSaveRioWithEOFMark(req,&rdb,NULL,rsi);
}
if (retval == C_OK && rioFlush(&rdb) == 0)
retval = C_ERR;
@ -4009,7 +4039,8 @@ int rdbSaveToSlavesSockets(int req, rdbSaveInfo *rsi) {
}
} else {
serverLog(LL_NOTICE, "Background RDB transfer started by pid %ld to %s", (long)childpid,
rdb_channel ? "replica socket" : "parent process pipe");
rdb_channel ? (slots_req ? "slot migration destination socket" : "replica socket") :
"parent process pipe");
server.rdb_save_time_start = time(NULL);
server.rdb_child_type = RDB_CHILD_TYPE_SOCKET;
if (!rdb_channel) {

View file

@ -236,11 +236,12 @@ This flag should not be used directly by the module.
#define REDISMODULE_NOTIFY_NEW (1<<14) /* n, new key notification */
#define REDISMODULE_NOTIFY_OVERWRITTEN (1<<15) /* o, key overwrite notification */
#define REDISMODULE_NOTIFY_TYPE_CHANGED (1<<16) /* c, key type changed notification */
#define REDISMODULE_NOTIFY_KEY_TRIMMED (1<<17) /* module only key space notification, indicates a key trimmed during slot migration */
/* Next notification flag, must be updated when adding new flags above!
This flag should not be used directly by the module.
* Use RedisModule_GetKeyspaceNotificationFlagsAll instead. */
#define _REDISMODULE_NOTIFY_NEXT (1<<17)
#define _REDISMODULE_NOTIFY_NEXT (1<<18)
#define REDISMODULE_NOTIFY_ALL (REDISMODULE_NOTIFY_GENERIC | REDISMODULE_NOTIFY_STRING | REDISMODULE_NOTIFY_LIST | REDISMODULE_NOTIFY_SET | REDISMODULE_NOTIFY_HASH | REDISMODULE_NOTIFY_ZSET | REDISMODULE_NOTIFY_EXPIRED | REDISMODULE_NOTIFY_EVICTED | REDISMODULE_NOTIFY_STREAM | REDISMODULE_NOTIFY_MODULE) /* A */
@ -507,7 +508,9 @@ typedef void (*RedisModuleEventLoopOneShotFunc)(void *user_data);
#define REDISMODULE_EVENT_EVENTLOOP 15
#define REDISMODULE_EVENT_CONFIG 16
#define REDISMODULE_EVENT_KEY 17
#define _REDISMODULE_EVENT_NEXT 18 /* Next event flag, should be updated if a new event added. */
#define REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION 18
#define REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM 19
#define _REDISMODULE_EVENT_NEXT 20 /* Next event flag, should be updated if a new event added. */
typedef struct RedisModuleEvent {
uint64_t id; /* REDISMODULE_EVENT_... defines. */
@ -618,6 +621,14 @@ static const RedisModuleEvent
RedisModuleEvent_Key = {
REDISMODULE_EVENT_KEY,
1
},
RedisModuleEvent_ClusterSlotMigration = {
REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION,
1
},
RedisModuleEvent_ClusterSlotMigrationTrim = {
REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM,
1
};
/* Those are values that are used for the 'subevent' callback argument. */
@ -696,6 +707,20 @@ static const RedisModuleEvent
#define _REDISMODULE_SUBEVENT_CRON_LOOP_NEXT 0
#define _REDISMODULE_SUBEVENT_SWAPDB_NEXT 0
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_STARTED 0
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_FAILED 1
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_COMPLETED 2
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_STARTED 3
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_FAILED 4
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_COMPLETED 5
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE 6
#define _REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_NEXT 7
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_STARTED 0
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_COMPLETED 1
#define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_BACKGROUND 2
#define _REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_NEXT 3
/* RedisModuleClientInfo flags. */
#define REDISMODULE_CLIENTINFO_FLAG_SSL (1<<0)
#define REDISMODULE_CLIENTINFO_FLAG_PUBSUB (1<<1)
@ -825,6 +850,41 @@ typedef struct RedisModuleKeyInfo {
#define RedisModuleKeyInfo RedisModuleKeyInfoV1
typedef struct RedisModuleSlotRange {
uint16_t start;
uint16_t end;
} RedisModuleSlotRange;
typedef struct RedisModuleSlotRangeArray {
int32_t num_ranges;
RedisModuleSlotRange ranges[];
} RedisModuleSlotRangeArray;
#define REDISMODULE_CLUSTER_SLOT_MIGRATION_INFO_VERSION 1
typedef struct RedisModuleClusterSlotMigrationInfo {
uint64_t version; /* Not used since this structure is never passed
from the module to the core right now. Here
for future compatibility. */
char source_node_id[REDISMODULE_NODE_ID_LEN + 1];
char destination_node_id[REDISMODULE_NODE_ID_LEN + 1];
const char *task_id;
RedisModuleSlotRangeArray *slots;
} RedisModuleClusterSlotMigrationInfoV1;
#define RedisModuleClusterSlotMigrationInfo RedisModuleClusterSlotMigrationInfoV1
#define REDISMODULE_CLUSTER_SLOT_MIGRATION_TRIMINFO_VERSION 1
typedef struct RedisModuleClusterSlotMigrationTrimInfo {
uint64_t version; /* Not used since this structure is never passed
from the module to the core right now. Here
for future compatibility. */
RedisModuleSlotRangeArray *slots;
} RedisModuleClusterSlotMigrationTrimInfoV1;
#define RedisModuleClusterSlotMigrationTrimInfo RedisModuleClusterSlotMigrationTrimInfoV1
typedef enum {
REDISMODULE_ACL_LOG_AUTH = 0, /* Authentication failure */
REDISMODULE_ACL_LOG_CMD, /* Command authorization failure */
@ -1276,6 +1336,10 @@ REDISMODULE_API void (*RedisModule_SetDisconnectCallback)(RedisModuleBlockedClie
REDISMODULE_API void (*RedisModule_SetClusterFlags)(RedisModuleCtx *ctx, uint64_t flags) REDISMODULE_ATTR;
REDISMODULE_API unsigned int (*RedisModule_ClusterKeySlot)(RedisModuleString *key) REDISMODULE_ATTR;
REDISMODULE_API const char *(*RedisModule_ClusterCanonicalKeyNameInSlot)(unsigned int slot) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_ClusterCanAccessKeysInSlot)(int slot) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_ClusterPropagateForSlotMigration)(RedisModuleCtx *ctx, const char *cmdname, const char *fmt, ...) REDISMODULE_ATTR;
REDISMODULE_API RedisModuleSlotRangeArray *(*RedisModule_ClusterGetLocalSlotRanges)(RedisModuleCtx *ctx) REDISMODULE_ATTR;
REDISMODULE_API void (*RedisModule_ClusterFreeSlotRanges)(RedisModuleCtx *ctx, RedisModuleSlotRangeArray *slots) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_ExportSharedAPI)(RedisModuleCtx *ctx, const char *apiname, void *func) REDISMODULE_ATTR;
REDISMODULE_API void * (*RedisModule_GetSharedAPI)(RedisModuleCtx *ctx, const char *apiname) REDISMODULE_ATTR;
REDISMODULE_API RedisModuleCommandFilter * (*RedisModule_RegisterCommandFilter)(RedisModuleCtx *ctx, RedisModuleCommandFilterFunc cb, int flags) REDISMODULE_ATTR;
@ -1664,6 +1728,10 @@ static int RedisModule_Init(RedisModuleCtx *ctx, const char *name, int ver, int
REDISMODULE_GET_API(SetClusterFlags);
REDISMODULE_GET_API(ClusterKeySlot);
REDISMODULE_GET_API(ClusterCanonicalKeyNameInSlot);
REDISMODULE_GET_API(ClusterCanAccessKeysInSlot);
REDISMODULE_GET_API(ClusterPropagateForSlotMigration);
REDISMODULE_GET_API(ClusterGetLocalSlotRanges);
REDISMODULE_GET_API(ClusterFreeSlotRanges);
REDISMODULE_GET_API(ExportSharedAPI);
REDISMODULE_GET_API(GetSharedAPI);
REDISMODULE_GET_API(RegisterCommandFilter);

View file

@ -32,6 +32,7 @@
#include "bio.h"
#include "functions.h"
#include "connection.h"
#include "cluster_asm.h"
#include <memory.h>
#include <sys/time.h>
@ -97,7 +98,7 @@ unsigned long replicationLogicalReplicaCount(void) {
return count;
}
static ConnectionType *connTypeOfReplication(void) {
ConnectionType *connTypeOfReplication(void) {
if (server.tls_replication) {
return connectionTypeTls();
}
@ -246,8 +247,10 @@ void resetReplicationBuffer(void) {
}
int canFeedReplicaReplBuffer(client *replica) {
/* Don't feed replicas that only want the RDB. */
if (replica->flags & CLIENT_REPL_RDBONLY) return 0;
/* Don't feed replicas that only want the RDB or main channels of migration
* destinations which need filtered stream for migrating slot ranges. */
if (replica->flags & CLIENT_REPL_RDBONLY ||
replica->flags & CLIENT_ASM_MIGRATING) return 0;
/* Don't feed replicas that are still waiting for BGSAVE to start. */
if (replica->replstate == SLAVE_STATE_WAIT_BGSAVE_START ||
@ -511,6 +514,11 @@ void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
* master replication history and has the same backlog and offsets). */
if (server.masterhost != NULL) return;
/* If current client is marked as master, we will proxy the command stream
* to our slaves instead of replicating them, that also happens when being
* in atomic slot migration. */
if (server.current_client && server.current_client->flags & CLIENT_MASTER) return;
/* If there aren't slaves, and there is no backlog buffer to populate,
* we can return ASAP. */
if (server.repl_backlog == NULL && listLength(slaves) == 0) {
@ -624,8 +632,8 @@ void showLatestBacklog(void) {
}
/* This function is used in order to proxy what we receive from our master
* to our sub-slaves. */
#include <ctype.h>
* to our sub-slaves. Besides, we also proxy the replication stream from
* the source node when being in atomic slot migration. */
void replicationFeedStreamFromMasterStream(char *buf, size_t buflen) {
/* There must be replication backlog if having attached slaves. */
if (listLength(server.slaves)) serverAssert(server.repl_backlog != NULL);
@ -634,6 +642,14 @@ void replicationFeedStreamFromMasterStream(char *buf, size_t buflen) {
* replication stream. */
prepareReplicasToWrite();
feedReplicationBuffer(buf,buflen);
} else if (server.masterhost == NULL && server.aof_enabled) {
/* We increment the repl_offset anyway, since we use that for tracking
* AOF fsyncs even when there's no replication active. This code will
* not be reached if AOF is also disabled.
*
* As we skip feeding the replication buffer in atomic slot migration,
* so here we need to update the replication offset manually. */
server.master_repl_offset += 1;
}
}
@ -787,6 +803,21 @@ int replicationSetupSlaveForFullResync(client *slave, long long offset) {
* a SELECT statement in the replication stream. */
server.slaveseldb = -1;
/* Slots snapshot. */
if (slave->flags & CLIENT_REPL_RDB_CHANNEL &&
slave->slave_req & SLAVE_REQ_SLOTS_SNAPSHOT)
{
/* Start to deliver the commands stream on migrating slots. */
asmSlotSnapshotAndStreamStart(slave->task);
buflen = snprintf(buf, sizeof(buf), "+SLOTSSNAPSHOT\r\n");
if (connWrite(slave->conn, buf, buflen) != buflen) {
freeClientAsync(slave);
return C_ERR;
}
return C_OK;
}
/* Don't send this reply to slaves that approached us with
* the old SYNC command. */
if (!(slave->flags & CLIENT_PRE_PSYNC)) {
@ -951,8 +982,9 @@ int startBgsaveForReplication(int mincapa, int req) {
/* `SYNC` should have failed with error if we don't support socket and require a filter, assert this here */
serverAssert(socket_target || !(req & SLAVE_REQ_RDB_MASK));
int slots_req = req & SLAVE_REQ_SLOTS_SNAPSHOT;
serverLog(LL_NOTICE,"Starting BGSAVE for SYNC with target: %s%s",
socket_target ? "replicas sockets" : "disk",
socket_target ? (slots_req ? "slot migration destination socket" : "replicas sockets") : "disk",
(req & SLAVE_REQ_RDB_CHANNEL) ? " (rdb-channel)" : "");
rdbSaveInfo rsi, *rsiptr;
@ -1164,6 +1196,11 @@ void syncCommand(client *c) {
/* Create the replication backlog if needed. */
createReplicationBacklogIfNeeded();
/* Keep the client in the main thread to avoid data races between the
* connWrite call in startBgsaveForReplication and the client's event
* handler in IO threads. */
if (c->tid != IOTHREAD_MAIN_THREAD_ID) keepClientInMainThread(c);
/* CASE 1: BGSAVE is in progress, with disk target. */
if (server.child_type == CHILD_TYPE_RDB &&
server.rdb_child_type == RDB_CHILD_TYPE_DISK)
@ -1452,6 +1489,10 @@ int replicaPutOnline(client *slave) {
replicationGetSlaveName(slave));
return 0;
}
/* Don't put migration destination client online. */
if (slave->flags & CLIENT_ASM_MIGRATING) return 0;
slave->replstate = SLAVE_STATE_ONLINE;
slave->repl_ack_time = server.unixtime; /* Prevent false timeout. */
@ -1780,14 +1821,21 @@ void updateSlavesWaitingBgsave(int bgsaveerr, int type) {
if (slave->replstate == SLAVE_STATE_SEND_BULK_AND_STREAM) {
/* This is the main channel of the slave that received the RDB.
* Put it online if RDB delivery is successful. */
if (bgsaveerr == C_OK)
if (bgsaveerr == C_OK) {
/* Notify the task that the snapshot bulk delivery is done */
if (slave->flags & CLIENT_ASM_MIGRATING)
asmSlotSnapshotSucceed(slave->task);
replicaPutOnline(slave);
else
} else {
freeClientAsync(slave);
}
} else if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) {
struct redis_stat buf;
if (bgsaveerr != C_OK) {
/* Notify the task that the snapshot bulk delivery failed */
if (slave->flags & CLIENT_ASM_MIGRATING)
asmSlotSnapshotFailed(slave->task);
freeClientAsync(slave);
serverLog(LL_WARNING,"SYNC failed. BGSAVE child returned an error");
continue;
@ -1799,6 +1847,13 @@ void updateSlavesWaitingBgsave(int bgsaveerr, int type) {
* diskless replication, our work is trivial, we can just put
* the slave online. */
if (type == RDB_CHILD_TYPE_SOCKET) {
/* Slots snapshot */
if (slave->slave_req & SLAVE_REQ_SLOTS_SNAPSHOT) {
serverLog(LL_NOTICE, "Streamed slots snapshot transfer succeeded");
freeClientAsync(slave);
continue;
}
serverLog(LL_NOTICE,
"Streamed RDB transfer with replica %s succeeded (socket). Waiting for REPLCONF ACK from replica to enable streaming",
replicationGetSlaveName(slave));
@ -2352,6 +2407,8 @@ void readSyncBulkPayload(connection *conn) {
/* RDB loading succeeded if we reach this point. */
if (server.repl_diskless_load == REPL_DISKLESS_LOAD_SWAPDB) {
/* Cancel all ASM trim jobs as we are about to swap the main db. */
asmCancelTrimJobs();
/* We will soon swap main db with tempDb and replicas will start
* to apply data from new master, we must discard the cached
* master structure and force resync of sub-replicas. */
@ -3719,78 +3776,75 @@ error:
rdbChannelAbort();
}
void replDataBufInit(replDataBuf *buf) {
serverAssert(buf->blocks == NULL);
buf->size = 0;
buf->used = 0;
buf->last_num_blocks = 0;
buf->mem_used = 0;
buf->blocks = listCreate();
buf->blocks->free = zfree;
}
void replDataBufClear(replDataBuf *buf) {
if (buf->blocks) listRelease(buf->blocks);
buf->blocks = NULL;
buf->size = 0;
buf->used = 0;
buf->last_num_blocks = 0;
buf->mem_used = 0;
}
/* Replication: Replica side.
* Initialize replica's local replication buffer to accumulate repl stream
* during rdb channel sync. */
static void rdbChannelReplDataBufInit(void) {
serverAssert(server.repl_full_sync_buffer.blocks == NULL);
server.repl_full_sync_buffer.size = 0;
server.repl_full_sync_buffer.used = 0;
server.repl_full_sync_buffer.last_num_blocks = 0;
server.repl_full_sync_buffer.mem_used = 0;
server.repl_full_sync_buffer.blocks = listCreate();
server.repl_full_sync_buffer.blocks->free = zfree;
replDataBufInit(&server.repl_full_sync_buffer);
}
/* Replication: Replica side.
* Free replica's local replication buffer */
static void rdbChannelReplDataBufFree(void) {
listRelease(server.repl_full_sync_buffer.blocks);
server.repl_full_sync_buffer.blocks = NULL;
server.repl_full_sync_buffer.size = 0;
server.repl_full_sync_buffer.used = 0;
server.repl_full_sync_buffer.last_num_blocks = 0;
server.repl_full_sync_buffer.mem_used = 0;
* Clear replica's local replication buffer */
static void rdbChannelReplDataBufClear(void) {
replDataBufClear(&server.repl_full_sync_buffer);
}
/* Replication: Replica side.
* Reads replication data from master connection into the repl buffer block */
int rdbChannelReadIntoBuf(connection *conn, replDataBufBlock *b) {
/* Generic function to read data from connection into the last block. */
static int replDataBufReadIntoLastBlock(connection *conn, replDataBuf *buf,
void (*error_handler)(connection *conn))
{
atomicIncr(server.stat_io_reads_processed[IOTHREAD_MAIN_THREAD_ID], 1);
int nread = connRead(conn, b->buf + b->used, b->size - b->used);
replDataBufBlock *block = listNodeValue(listLast(buf->blocks));
serverAssert(block && block->size > block->used);
int nread = connRead(conn, block->buf + block->used, block->size - block->used);
if (nread <= 0) {
if (nread == 0 || connGetState(conn) != CONN_STATE_CONNECTED) {
serverLog(LL_WARNING, "Main channel error while reading from master: %s",
connGetLastError(conn));
cancelReplicationHandshake(1);
error_handler(conn);
}
return -1;
}
b->used += nread;
server.repl_full_sync_buffer.used += nread;
block->used += nread;
if (buf) buf->used += nread;
atomicIncr(server.stat_net_repl_input_bytes, nread);
return nread;
}
/* Replication: Replica side.
* Read handler for buffering incoming repl data during RDB download/loading. */
void rdbChannelBufferReplData(connection *conn) {
/* Generic function to read data from connection into a buffer. */
void replDataBufReadFromConn(connection *conn, replDataBuf *buf, void (*error_handler)(connection *conn)) {
const int buflen = 1024 * 1024;
const int minread = 16 * 1024;
int nread = 0;
int needs_read = 1;
listNode *ln = listLast(server.repl_full_sync_buffer.blocks);
listNode *ln = listLast(buf->blocks);
replDataBufBlock *tail = ln ? listNodeValue(ln) : NULL;
if (server.repl_main_ch_state & REPL_MAIN_CH_STREAMING_BUF) {
/* While streaming accumulated buffers, we continue reading from the
* master to prevent accumulation on master side as much as possible.
* However, we aim to drain buffer eventually. To ensure we consume more
* than we read, we'll read at most one block after two blocks of
* buffers are consumed. */
replDataBuf *buf = &server.repl_full_sync_buffer;
if (listLength(buf->blocks) + 1 >= buf->last_num_blocks)
return;
buf->last_num_blocks = listLength(buf->blocks);
}
/* Try to append last node. */
if (tail && tail->size > tail->used) {
nread = rdbChannelReadIntoBuf(conn, tail);
nread = replDataBufReadIntoLastBlock(conn, buf, error_handler);
if (nread <= 0)
return;
@ -3808,11 +3862,18 @@ void rdbChannelBufferReplData(connection *conn) {
* the limit.*/
limit = server.repl_full_sync_buffer_limit;
if (limit == 0)
limit = server.client_obuf_limits[CLIENT_TYPE_SLAVE].hard_limit_bytes;
limit = server.client_obuf_limits[CLIENT_TYPE_SLAVE].hard_limit_bytes;
if (limit != 0 && buf->size > limit) {
/* Currently this function is only used for replication and slots sync.
* Log accordingly, maybe should be extendable in the future. */
if (server.masterhost)
serverLog(LL_NOTICE, "Replication buffer limit has been reached (%llu bytes), "
"stopped buffering replication stream. Further accumulation may occur on master side.", limit);
else
serverLog(LL_NOTICE, "Slots sync buffer limit has been reached (%llu bytes), "
"stopped buffering slots sync stream. Further accumulation may occur on source side.", limit);
if (limit != 0 && server.repl_full_sync_buffer.size > limit) {
serverLog(LL_NOTICE, "Replication buffer limit has been reached (%llu bytes), "
"stopped buffering replication stream. Further accumulation may occur on master side. ", limit);
connSetReadHandler(conn, NULL);
return;
}
@ -3821,30 +3882,148 @@ void rdbChannelBufferReplData(connection *conn) {
tail->size = usable_size - sizeof(replDataBufBlock);
tail->used = 0;
listAddNodeTail(server.repl_full_sync_buffer.blocks, tail);
server.repl_full_sync_buffer.size += tail->size;
server.repl_full_sync_buffer.mem_used += usable_size + sizeof(listNode);
listAddNodeTail(buf->blocks, tail);
buf->size += tail->size;
buf->mem_used += usable_size + sizeof(listNode);
/* Update buffer's peak */
if (server.repl_full_sync_buffer.peak < server.repl_full_sync_buffer.size)
server.repl_full_sync_buffer.peak = server.repl_full_sync_buffer.size;
if (buf->peak < buf->size)
buf->peak = buf->size;
rdbChannelReadIntoBuf(conn, tail);
replDataBufReadIntoLastBlock(conn, buf, error_handler);
}
}
/* Replication: Replica side.
* Main channel read error handler */
static void readReplBufferErrorHandler(connection *conn) {
serverLog(LL_WARNING, "Main channel error while reading from master: %s",
connGetLastError(conn));
cancelReplicationHandshake(1);
}
/* Replication: Replica side.
* Read handler for buffering incoming repl data during RDB download/loading. */
static void rdbChannelBufferReplData(connection *conn) {
replDataBuf *buf = &server.repl_full_sync_buffer;
if (server.repl_main_ch_state & REPL_MAIN_CH_STREAMING_BUF) {
/* While streaming accumulated buffers, we continue reading from the
* master to prevent accumulation on master side as much as possible.
* However, we aim to drain buffer eventually. To ensure we consume more
* than we read, we'll read at most one block after two blocks of
* buffers are consumed. */
if (listLength(buf->blocks) + 1 >= buf->last_num_blocks)
return;
buf->last_num_blocks = listLength(buf->blocks);
}
replDataBufReadFromConn(conn, buf, readReplBufferErrorHandler);
}
/* Generic function to stream replDataBuf data into database
* Returns C_OK on success, C_ERR on error */
int replDataBufStreamToDb(replDataBuf *buf, replDataBufToDbCtx *ctx) {
listNode *n;
int ret = C_OK;
client *c = ctx->client;
blockingOperationStarts();
while ((n = listFirst(buf->blocks))) {
replDataBufBlock *o = listNodeValue(n);
listUnlinkNode(buf->blocks, n);
zfree(n);
size_t processed = 0;
while (processed < o->used) {
size_t bytes = min(PROTO_IOBUF_LEN, o->used - processed);
c->querybuf = sdscatlen(c->querybuf, &o->buf[processed], bytes);
c->read_reploff += (long long int) bytes;
c->lastinteraction = server.unixtime;
/* We don't expect error return value but just in case. */
ret = processInputBuffer(c);
if (ret != C_OK) break;
processed += bytes;
buf->used -= bytes;
if (server.repl_debug_pause & REPL_DEBUG_ON_STREAMING_REPL_BUF)
debugPauseProcess();
/* Check if we should yield back to the event loop */
if (server.loading_process_events_interval_bytes &&
((ctx->applied_offset + bytes) / server.loading_process_events_interval_bytes >
ctx->applied_offset / server.loading_process_events_interval_bytes))
{
ctx->yield_callback(ctx);
processEventsWhileBlocked();
}
ctx->applied_offset += bytes;
/* Check if we should continue processing */
if (!ctx->should_continue(ctx)) {
ret = C_ERR;
break;
}
/* Streaming buffer into the database more slowly is useful in order
* to test certain edge cases. */
if (server.key_load_delay) debugDelay(server.key_load_delay);
}
size_t size = o->size;
zfree(o);
/* Break the loop if there is an error. */
if (ret != C_OK) break;
/* Update stats */
buf->size -= size;
buf->mem_used -= (size + sizeof(listNode) + sizeof(replDataBufBlock));
}
blockingOperationEnds();
return ret;
}
/* Replication: Replica side.
* Yield callback for streaming replDataBuf to database */
static void rdbChannelStreamYieldCallback(void *ctx) {
UNUSED(ctx);
replicationSendNewlineToMaster();
}
/* Replication: Replica side.
* Global variable to track number of master disconnection.
* Used to detect master disconnection when streaming replDataBuf to database */
static uint64_t ReplNumMasterDisconnection = 0;
/* Replication: Replica side.
* Check if we should continue streaming replDataBuf to database */
static int rdbChannelStreamShouldContinue(void *ctx) {
replDataBufToDbCtx *context = ctx;
/* Check if master client was freed in processEventsWhileBlocked().
* It can happen if we receive 'replicaof' command or 'client kill'
* command for the master. */
if (ReplNumMasterDisconnection != server.repl_num_master_disconnection ||
!server.repl_full_sync_buffer.blocks ||
context->client->flags & CLIENT_CLOSE_ASAP)
{
return 0;
}
return 1;
}
/* Replication: Replica side.
* Streams accumulated replication data into the database. */
static void rdbChannelStreamReplDataToDb(void) {
int ret = C_OK, master_disconnected = 0, close_asap = 0;
size_t offset = 0;
listNode *n = NULL;
replDataBufBlock *o = NULL;
int ret = C_OK, close_asap = 0;
client *c = server.master;
/* Save repl_num_master_disconnection to figure out if master gets
* disconnected when we yield back to processEventsWhileBlocked() */
uint64_t seq = server.repl_num_master_disconnection;
ReplNumMasterDisconnection = server.repl_num_master_disconnection;
server.repl_main_ch_state |= REPL_MAIN_CH_STREAMING_BUF;
serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Starting to stream replication buffer into the db"
@ -3858,63 +4037,14 @@ static void rdbChannelStreamReplDataToDb(void) {
/* Set read handler to continue accumulating during streaming */
connSetReadHandler(c->conn, rdbChannelBufferReplData);
blockingOperationStarts();
while ((n = listFirst(server.repl_full_sync_buffer.blocks))) {
o = listNodeValue(n);
listUnlinkNode(server.repl_full_sync_buffer.blocks, n);
zfree(n);
replDataBufToDbCtx ctx = {
.client = c,
.applied_offset = 0,
.should_continue = rdbChannelStreamShouldContinue,
.yield_callback = rdbChannelStreamYieldCallback,
};
size_t processed = 0;
while (processed < o->used) {
size_t bytes = min(PROTO_IOBUF_LEN, o->used - processed);
c->querybuf = sdscatlen(c->querybuf, &o->buf[processed], bytes);
c->read_reploff += (long long int) bytes;
/* We don't expect error return value but just in case. */
ret = processInputBuffer(c);
if (ret != C_OK)
break;
processed += bytes;
server.repl_full_sync_buffer.used -= bytes;
if (server.repl_debug_pause & REPL_DEBUG_ON_STREAMING_REPL_BUF)
debugPauseProcess();
/* Check if we should yield back to the event loop */
if (server.loading_process_events_interval_bytes &&
((offset + bytes) / server.loading_process_events_interval_bytes >
offset / server.loading_process_events_interval_bytes))
{
replicationSendNewlineToMaster();
processEventsWhileBlocked();
}
offset += bytes;
/* Check if master client was freed in processEventsWhileBlocked().
* It can happen if we receive 'replicaof' command or 'client kill'
* command for the master. */
master_disconnected = (seq != server.repl_num_master_disconnection);
if (master_disconnected ||
!server.repl_full_sync_buffer.blocks ||
c->flags & CLIENT_CLOSE_ASAP)
{
ret = C_ERR;
break;
}
}
size_t size = o->size;
zfree(o);
/* Break the loop if there is an error. */
if (ret != C_OK)
break;
/* Update stats */
server.repl_full_sync_buffer.size -= size;
server.repl_full_sync_buffer.mem_used -= (size + sizeof(listNode) +
sizeof(replDataBufBlock));
}
blockingOperationEnds();
ret = replDataBufStreamToDb(&server.repl_full_sync_buffer, &ctx);
out:
/* If main channel state is CLOSE_ASAP, it means main channel faced a
@ -3925,7 +4055,8 @@ out:
close_asap = (server.repl_main_ch_state & REPL_MAIN_CH_CLOSE_ASAP);
if (ret == C_OK) {
serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Successfully streamed replication buffer into the db (%zu bytes in total)", offset);
serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Successfully streamed replication buffer into the db (%zu bytes in total)",
ctx.applied_offset);
/* Revert the read handler */
if (!close_asap && connSetReadHandler(c->conn, readQueryFromClient) != C_OK) {
serverLog(LL_WARNING,
@ -3938,9 +4069,9 @@ out:
close_asap = 1;
}
/* If master_disconnected is set, state should have been cleaned up
/* If master is disconnected, state should have been cleaned up
* already. Otherwise, we do it here. */
if (!master_disconnected) {
if (ReplNumMasterDisconnection == server.repl_num_master_disconnection) {
rdbChannelCleanup();
if (server.master && close_asap)
freeClient(server.master);
@ -3950,7 +4081,7 @@ out:
static void rdbChannelCleanup(void) {
server.repl_rdb_ch_state = REPL_RDB_CH_STATE_NONE;
server.repl_main_ch_state = REPL_MAIN_CH_NONE;
rdbChannelReplDataBufFree();
rdbChannelReplDataBufClear();
}
/* Replication: Replica side.
@ -5051,6 +5182,8 @@ void failoverCommand(client *c) {
server.force_failover = force_flag;
server.failover_state = FAILOVER_WAIT_FOR_SYNC;
/* Cancel all ASM tasks when starting failover */
clusterAsmCancel(NULL, "failover requested");
/* Cluster failover will unpause eventually */
pauseActions(PAUSE_DURING_FAILOVER,
LLONG_MAX,

View file

@ -480,6 +480,7 @@ static size_t rioConnsetWrite(rio *r, const void *buf, size_t len) {
errno = ETIMEDOUT;
r->io.connset.dst[i].failed = 1;
failed++;
break;
}
n_written += ret;

View file

@ -28,6 +28,7 @@
#include "fmtargs.h"
#include "mstr.h"
#include "ebuckets.h"
#include "cluster_asm.h"
#include "fwtree.h"
#include "estore.h"
@ -1211,6 +1212,10 @@ void databasesCron(void) {
/* Defrag keys gradually. */
activeDefragCycle();
/* Handle active-trim */
if (server.cluster_enabled)
asmActiveTrimCycle();
/* Perform hash tables rehashing if needed, but only if there are no
* other processes saving the DB on disk. Otherwise rehashing is bad
* as will cause a lot of copy-on-write of memory pages. */
@ -2224,6 +2229,7 @@ void initServerConfig(void) {
memset(server.listeners, 0x00, sizeof(server.listeners));
server.active_expire_enabled = 1;
server.allow_access_expired = 0;
server.allow_access_trimmed = 0;
server.skip_checksum_validation = 0;
server.loading = 0;
server.async_loading = 0;
@ -3453,7 +3459,7 @@ static int shouldPropagate(int target) {
return 1;
}
if (target & PROPAGATE_REPL) {
if (server.masterhost == NULL && (server.repl_backlog || listLength(server.slaves) != 0))
if (server.masterhost == NULL && (server.repl_backlog || listLength(server.slaves) != 0 || asmMigrateInProgress()))
return 1;
}
@ -3486,8 +3492,10 @@ static void propagateNow(int dbid, robj **argv, int argc, int target) {
if (server.aof_state != AOF_OFF && target & PROPAGATE_AOF)
feedAppendOnlyFile(dbid,argv,argc);
if (target & PROPAGATE_REPL)
if (target & PROPAGATE_REPL) {
replicationFeedSlaves(server.slaves,dbid,argv,argc);
asmFeedMigrationClient(argv, argc);
}
}
/* Used inside commands to schedule the propagation of additional commands
@ -4326,6 +4334,26 @@ int processCommand(client *c) {
return C_OK;
}
/* If this node is a replica and there is a trim job due to slot migration,
* we cannot process commands from the master for the slot being trimmed.
* Otherwise, the trim cycle could mistakenly delete newly added keys.
* In this case, the master will be blocked until the trim job finishes.
* This is supposed to be a rare event as it needs to migrate slots and
* import them back before the trim job is done. */
if ((c->flags & CLIENT_MASTER) && is_write_command && server.cluster_enabled) {
/* Check if the command is accessing keys in a slot being trimmed. */
int slot_in_trim = asmGetTrimmingSlotForCommand(c->cmd, c->argv, c->argc);
if (slot_in_trim != -1) {
serverLog(LL_WARNING, "Master is sending command for slot %d. "
"There is an trim job in progress for this slot. "
"This replica cannot process this command right now. "
"Blocking master client until trim job is done. ", slot_in_trim);
/* Block master client */
blockPostponeClientWithType(c, BLOCKED_POSTPONE_TRIM);
return C_OK;
}
}
/* Only allow a subset of commands in the context of Pub/Sub if the
* connection is in RESP2 mode. With RESP3 there are no limits. */
if ((c->flags & CLIENT_PUBSUB && c->resp == 2) &&
@ -4586,6 +4614,9 @@ int prepareForShutdown(int flags) {
if (server.supervised_mode == SUPERVISED_SYSTEMD)
redisCommunicateSystemd("STOPPING=1\n");
/* Cancel all ASM tasks before shutting down. */
clusterAsmCancel(NULL, "server shutdown");
/* If we have any replicas, let them catch up the replication offset before
* we shut down, to avoid data loss. */
if (!(flags & SHUTDOWN_NOW) &&
@ -4619,6 +4650,8 @@ int isReadyToShutdown(void) {
listRewind(server.slaves, &li);
while ((ln = listNext(&li)) != NULL) {
client *replica = listNodeValue(ln);
/* Don't count migration destination replicas. */
if (replica->flags & CLIENT_ASM_MIGRATING) continue;
if (replica->repl_ack_off != server.master_repl_offset) return 0;
}
return 1;
@ -4665,6 +4698,8 @@ int finishShutdown(void) {
listRewind(server.slaves, &replicas_iter);
while ((replicas_list_node = listNext(&replicas_iter)) != NULL) {
client *replica = listNodeValue(replicas_list_node);
/* Don't count migration destination replicas. */
if (replica->flags & CLIENT_ASM_MIGRATING) continue;
num_replicas++;
if (replica->repl_ack_off != server.master_repl_offset) {
num_lagging_replicas++;
@ -6022,6 +6057,9 @@ sds genRedisInfoString(dict *section_dict, int all_sections, int everything) {
"mem_replica_full_sync_buffer:%zu\r\n", server.repl_full_sync_buffer.mem_used,
"mem_clients_slaves:%zu\r\n", mh->clients_slaves,
"mem_clients_normal:%zu\r\n", mh->clients_normal,
"mem_cluster_slot_migration_output_buffer:%zu\r\n", mh->asm_migrate_output_buffer,
"mem_cluster_slot_migration_input_buffer:%zu\r\n", mh->asm_import_input_buffer,
"mem_cluster_slot_migration_input_buffer_peak:%zu\r\n", asmGetPeakSyncBufferSize(),
"mem_cluster_links:%zu\r\n", mh->cluster_links,
"mem_aof_buffer:%zu\r\n", mh->aof_buffer,
"mem_allocator:%s\r\n", ZMALLOC_LIB,
@ -6342,6 +6380,10 @@ sds genRedisInfoString(dict *section_dict, int all_sections, int everything) {
if (replicationCheckHasMainChannel(slave))
continue;
/* Don't list migration destination replicas. */
if (slave->flags & CLIENT_ASM_MIGRATING)
continue;
if (!slaveip) {
if (connAddrPeerName(slave->conn,ip,sizeof(ip),&port) == -1)
continue;

View file

@ -431,6 +431,8 @@ extern int configOOMScoreAdjValuesDefaults[CONFIG_OOM_COUNT];
#define CLIENT_REEXECUTING_COMMAND (1ULL<<50) /* The client is re-executing the command. */
#define CLIENT_REPL_RDB_CHANNEL (1ULL<<51) /* Client which is used for rdb delivery as part of rdb channel replication */
#define CLIENT_INTERNAL (1ULL<<52) /* Internal client connection */
#define CLIENT_ASM_MIGRATING (1ULL<<53) /* Client is migrating RDB/stream data during atomic slot migration. */
#define CLIENT_ASM_IMPORTING (1ULL<<54) /* Client is importing RDB/stream data during atomic slot migration. */
/* Any flag that does not let optimize FLUSH SYNC to run it in bg as blocking client ASYNC */
#define CLIENT_AVOID_BLOCKING_ASYNC_FLUSH (CLIENT_DENY_BLOCKING|CLIENT_MULTI|CLIENT_LUA_DEBUG|CLIENT_LUA_DEBUG_SYNC|CLIENT_MODULE)
@ -473,6 +475,7 @@ typedef enum blocking_type {
BLOCKED_STREAM, /* XREAD. */
BLOCKED_ZSET, /* BZPOP et al. */
BLOCKED_POSTPONE, /* Blocked by processCommand, re-try processing later. */
BLOCKED_POSTPONE_TRIM, /* Master client is blocked due to an active trim job. */
BLOCKED_SHUTDOWN, /* SHUTDOWN. */
BLOCKED_LAZYFREE, /* LAZYFREE */
BLOCKED_NUM, /* Number of blocked states. */
@ -569,9 +572,10 @@ typedef enum {
#define SLAVE_REQ_NONE 0
#define SLAVE_REQ_RDB_EXCLUDE_DATA (1 << 0) /* Exclude data from RDB */
#define SLAVE_REQ_RDB_EXCLUDE_FUNCTIONS (1 << 1) /* Exclude functions from RDB */
#define SLAVE_REQ_RDB_CHANNEL (1 << 2) /* Use rdb channel replication, transfer RDB background */
#define SLAVE_REQ_SLOTS_SNAPSHOT (1 << 2) /* Only slots snapshot is required */
#define SLAVE_REQ_RDB_CHANNEL (1 << 3) /* Use rdb channel replication, transfer RDB background */
/* Mask of all bits in the slave requirements bitfield that represent non-standard (filtered) RDB requirements */
#define SLAVE_REQ_RDB_MASK (SLAVE_REQ_RDB_EXCLUDE_DATA | SLAVE_REQ_RDB_EXCLUDE_FUNCTIONS)
#define SLAVE_REQ_RDB_MASK (SLAVE_REQ_RDB_EXCLUDE_DATA | SLAVE_REQ_RDB_EXCLUDE_FUNCTIONS | SLAVE_REQ_SLOTS_SNAPSHOT)
/* Synchronous read timeout - slave side */
#define CONFIG_REPL_SYNCIO_TIMEOUT 5
@ -719,6 +723,7 @@ typedef enum {
PAUSE_BY_CLIENT_COMMAND = 0,
PAUSE_DURING_SHUTDOWN,
PAUSE_DURING_FAILOVER,
PAUSE_DURING_SLOT_HANDOFF,
NUM_PAUSE_PURPOSES /* This value is the number of purposes above. */
} pause_purpose;
@ -758,6 +763,7 @@ typedef enum {
#define NOTIFY_NEW (1<<14) /* n, new key notification (Note: excluded from NOTIFY_ALL) */
#define NOTIFY_OVERWRITTEN (1<<15) /* o, key overwrite notification (Note: excluded from NOTIFY_ALL) */
#define NOTIFY_TYPE_CHANGED (1<<16) /* c, key type changed notification (Note: excluded from NOTIFY_ALL) */
#define NOTIFY_KEY_TRIMMED (1<<17) /* module only key space notification, indicates a key trimmed during slot migration */
#define NOTIFY_ALL (NOTIFY_GENERIC | NOTIFY_STRING | NOTIFY_LIST | NOTIFY_SET | NOTIFY_HASH | NOTIFY_ZSET | NOTIFY_EXPIRED | NOTIFY_EVICTED | NOTIFY_STREAM | NOTIFY_MODULE) /* A flag */
/* Using the following macro you can run code inside serverCron() with the
@ -840,6 +846,7 @@ struct RedisModuleKeyOptCtx;
struct RedisModuleCommand;
struct clusterState;
struct clusterSlotStat;
struct slotRangeArray;
/* Each module type implementation should export a set of methods in order
* to serialize and deserialize the value in the RDB file, rewrite the AOF
@ -1469,6 +1476,8 @@ typedef struct client {
unsigned long long net_input_bytes; /* Total network input bytes read from this client. */
unsigned long long net_output_bytes; /* Total network output bytes sent to this client. */
unsigned long long commands_processed; /* Total count of commands this client executed. */
struct asmTask *task; /* Atomic slot migration task */
char *node_id; /* Node ID to connect to for atomic slot migration */
} client;
typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) {
@ -1486,6 +1495,15 @@ typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) {
list *clients; /* IO thread managed clients. */
} IOThread;
/* Context for streaming replDataBuf to database */
typedef struct replDataBufToDbCtx {
void *privdata; /* Private data of context */
client *client; /* Client to process commands */
size_t applied_offset; /* Offset applied to the database */
int (*should_continue)(void *ctx); /* Check if we should continue */
void (*yield_callback)(void *ctx); /* Yield to the event loop */
} replDataBufToDbCtx;
/* ACL information */
typedef struct aclInfo {
long long user_auth_failures; /* Auth failure counts on user level */
@ -1630,6 +1648,8 @@ struct redisMemOverhead {
size_t overhead_db_hashtable_lut;
size_t overhead_db_hashtable_rehashing;
unsigned long db_dict_rehashing_count;
size_t asm_import_input_buffer;
size_t asm_migrate_output_buffer;
struct {
size_t dbid;
size_t overhead_ht_main;
@ -1959,6 +1979,7 @@ struct redisServer {
int active_expire_enabled; /* Can be disabled for testing purposes. */
int active_expire_effort; /* From 1 (default) to 10, active effort. */
int allow_access_expired; /* If > 0, allow access to logically expired keys */
int allow_access_trimmed; /* If > 0, allow access to logically trimmed keys */
int active_defrag_enabled;
int sanitize_dump_payload; /* Enables deep sanitization for ziplist and listpack in RDB and RESTORE. */
int skip_checksum_validation; /* Disable checksum validation for RDB and RESTORE payload. */
@ -2237,6 +2258,10 @@ struct redisServer {
mstime_t cluster_node_timeout; /* Cluster node timeout. */
mstime_t cluster_ping_interval; /* A debug configuration for setting how often cluster nodes send ping messages. */
char *cluster_configfile; /* Cluster auto-generated config file name. */
long long asm_handoff_max_lag_bytes; /* Maximum lag in bytes before pausing writes for ASM handoff. */
long long asm_write_pause_timeout; /* Timeout in milliseconds to pause writes during ASM handoff. */
long long asm_sync_buffer_drain_timeout; /* Timeout in milliseconds for sync buffer to drain during ASM. */
int asm_max_archived_tasks; /* Maximum number of archived ASM tasks to keep in memory. */
struct clusterState *cluster; /* State of the cluster */
struct clusterSlotStat *cluster_slot_stats; /* Struct used for storing slot statistics, for all slots owned by the current shard. */
int cluster_migration_barrier; /* Cluster replicas migration barrier. */
@ -2799,6 +2824,7 @@ void moduleDefragStart(void);
void moduleDefragEnd(void);
void *moduleGetHandleByName(char *modulename);
int moduleIsModuleCommand(void *module_handle, struct redisCommand *cmd);
int moduleHasSubscribersForKeyspaceEvent(int type);
/* Utils */
long long ustime(void);
@ -2899,6 +2925,7 @@ void rewriteClientCommandArgument(client *c, int i, robj *newval);
void replaceClientCommandVector(client *c, int argc, robj **argv);
void redactClientCommandArgument(client *c, int argc);
size_t getClientOutputBufferMemoryUsage(client *c);
size_t getNormalClientPendingReplyBytes(client *c);
size_t getClientMemoryUsage(client *c, size_t *output_buffer_mem_usage);
int freeClientsInAsyncFreeQueue(void);
int closeClientOnOutputBufferLimitReached(client *c, int async);
@ -2952,6 +2979,7 @@ void unbindClientFromIOThreadEventLoop(client *c);
int processClientsOfAllIOThreads(void);
int processClientsFromMainThread(IOThread *t);
void assignClientToIOThread(client *c);
void keepClientInMainThread(client *c);
void fetchClientFromIOThread(client *c);
int isClientMustHandledByMainThread(client *c);
@ -3028,7 +3056,7 @@ void queueMultiCommand(client *c, uint64_t cmd_flags);
size_t multiStateMemOverhead(client *c);
void touchWatchedKey(redisDb *db, robj *key);
int isWatchedKeyExpired(client *c);
void touchAllWatchedKeysInDb(redisDb *emptied, redisDb *replaced_with);
void touchAllWatchedKeysInDb(redisDb *emptied, redisDb *replaced_with, struct slotRangeArray *slots);
void discardTransaction(client *c);
void flagTransaction(client *c);
void execCommandAbort(client *c, sds error);
@ -3145,6 +3173,10 @@ void abortFailover(const char *err);
const char *getFailoverStateString(void);
int replicationCheckHasMainChannel(client *slave);
unsigned long replicationLogicalReplicaCount(void);
void replDataBufInit(replDataBuf *buf);
void replDataBufClear(replDataBuf *buf);
void replDataBufReadFromConn(connection *conn, replDataBuf *buf, void (*error_handler)(connection *conn));
int replDataBufStreamToDb(replDataBuf *buf, replDataBufToDbCtx *ctx);
/* Generic persistence functions */
void startLoadingFile(size_t size, char* filename, int rdbflags);
@ -3189,6 +3221,7 @@ int aofDelHistoryFiles(void);
int aofRewriteLimited(void);
void updateCurIncrAofEndOffset(void);
void updateReplOffsetAndResetEndOffset(void);
int rewriteObject(rio *r, robj *key, robj *o, int dbid, long long expiretime);
/* Child info */
void openChildInfoPipe(void);
@ -3361,6 +3394,7 @@ int incrCommandStatsOnError(struct redisCommand *cmd, int flags);
void call(client *c, int flags);
void alsoPropagate(int dbid, robj **argv, int argc, int target);
void postExecutionUnitOperations(void);
int redisOpArrayAppend(redisOpArray *oa, int dbid, robj **argv, int argc, int target);
void redisOpArrayFree(redisOpArray *oa);
void forceCommandPropagation(client *c, int flags);
void preventCommandPropagation(client *c);
@ -3699,15 +3733,8 @@ kvobj *dbUnshareStringValueByLink(redisDb *db, robj *key, kvobj *kv, dictEntryLi
#define FLUSH_TYPE_ALL 0
#define FLUSH_TYPE_DB 1
#define FLUSH_TYPE_SLOTS 2
typedef struct SlotRange {
unsigned short first, last;
} SlotRange;
typedef struct SlotsFlush {
int numRanges;
SlotRange ranges[];
} SlotsFlush;
void replySlotsFlushAndFree(client *c, SlotsFlush *sflush);
int flushCommandCommon(client *c, int type, int flags, SlotsFlush *sflush);
void replySlotsFlushAndFree(client *c, struct slotRangeArray *slots);
int flushCommandCommon(client *c, int type, int flags, struct slotRangeArray *ranges);
#define EMPTYDB_NO_FLAGS 0 /* No flags. */
#define EMPTYDB_ASYNC (1<<0) /* Reclaim memory in another thread. */
#define EMPTYDB_NOFUNCTIONS (1<<1) /* Indicate not to flush the functions. */
@ -3721,11 +3748,12 @@ void discardTempDb(redisDb *tempDb);
int selectDb(client *c, int id);
void signalModifiedKey(client *c, redisDb *db, robj *key);
void signalFlushedDb(int dbid, int async);
void signalFlushedDb(int dbid, int async, struct slotRangeArray *slots);
void scanGenericCommand(client *c, robj *o, unsigned long long cursor);
int parseScanCursorOrReply(client *c, robj *o, unsigned long long *cursor);
int dbAsyncDelete(redisDb *db, robj *key);
void emptyDbAsync(redisDb *db);
void emptyDbDataAsync(kvstore *keys, kvstore *expires, ebuckets hexpires);
size_t lazyfreeGetPendingObjectsCount(void);
size_t lazyfreeGetFreedObjectsCount(void);
void lazyfreeResetStats(void);
@ -3740,6 +3768,9 @@ void freeReplicationBacklogRefMemAsync(list *blocks, rax *index);
int getKeysFromCommandWithSpecs(struct redisCommand *cmd, robj **argv, int argc, int search_flags, getKeysResult *result);
keyReference *getKeysPrepareResult(getKeysResult *result, int numkeys);
int getKeysFromCommand(struct redisCommand *cmd, robj **argv, int argc, getKeysResult *result);
#define GETSLOT_NOKEYS (-1)
#define GETSLOT_CROSSSLOT (-2)
int getSlotFromCommand(struct redisCommand *cmd, robj **argv, int argc);
int doesCommandHaveKeys(struct redisCommand *cmd);
int getChannelsFromCommand(struct redisCommand *cmd, robj **argv, int argc, getKeysResult *result);
@ -3829,11 +3860,12 @@ void signalKeyAsReady(redisDb *db, robj *key, int type);
void blockForKeys(client *c, int btype, robj **keys, int numkeys, mstime_t timeout, int unblock_on_nokey);
void blockClientShutdown(client *c);
void blockPostponeClient(client *c);
void blockPostponeClientWithType(client *c, int btype);
void blockForReplication(client *c, mstime_t timeout, long long offset, long numreplicas);
void blockForAofFsync(client *c, mstime_t timeout, long long offset, int numlocal, long numreplicas);
void signalDeletedKeyAsReady(redisDb *db, robj *key, int type);
void updateStatsOnUnblock(client *c, long blocked_us, long reply_us, int had_errors);
void scanDatabaseForDeletedKeys(redisDb *emptied, redisDb *replaced_with);
void scanDatabaseForDeletedKeys(redisDb *emptied, redisDb *replaced_with, struct slotRangeArray *slots);
void totalNumberOfStatefulKeys(unsigned long *blocking_keys, unsigned long *bloking_keys_on_nokey, unsigned long *watched_keys);
void blockedBeforeSleep(void);
@ -3972,6 +4004,7 @@ void sscanCommand(client *c);
void syncCommand(client *c);
void flushdbCommand(client *c);
void flushallCommand(client *c);
void trimslotsCommand(client *c);
void sortCommand(client *c);
void sortroCommand(client *c);
void lremCommand(client *c);

View file

@ -10,6 +10,7 @@
#include "server.h"
#include "redisassert.h"
#include "ebuckets.h"
#include "cluster_asm.h"
#include <math.h>
/* Threshold for HEXPIRE and HPERSIST to be considered whether it is worth to
@ -745,13 +746,18 @@ GetFieldRes hashTypeGetValue(redisDb *db, kvobj *o, sds field, unsigned char **v
(hfeFlags & HFE_LAZY_ACCESS_EXPIRED))
return GETF_OK;
if (server.masterhost) {
/* If CLIENT_MASTER, assume valid as long as it didn't get delete */
if (server.masterhost || server.cluster_enabled) {
/* If CLIENT_MASTER, assume valid as long as it didn't get delete.
*
* In cluster mode, we also assume valid if we are importing data
* from the source, to avoid deleting fields that are still in use.
* We create a fake master client for data import, which can be
* identified using the CLIENT_MASTER flag. */
if (server.current_client && (server.current_client->flags & CLIENT_MASTER))
return GETF_OK;
/* If user client, then act as if expired, but don't delete! */
return GETF_EXPIRED;
/* For replica, if user client, then act as if expired, but don't delete! */
if (server.masterhost) return GETF_EXPIRED;
}
if ((server.loading) ||
@ -1866,6 +1872,10 @@ uint64_t hashTypeActiveExpire(redisDb *db, kvobj *o, uint32_t *quota, int update
}
/* Delete all expired fields in hash if needed (Currently used only by HRANDFIELD)
*
* NOTICE: If we call this function in other places, we should consider the slot
* migration scenario, where we don't want to delete expired fields. See also
* expireIfNeeded().
*
* Return 1 if the entire hash was deleted, 0 otherwise.
* This function might be pricy in case there are many expired fields.

View file

@ -209,16 +209,34 @@ proc cluster_write_test {id} {
$cluster close
}
# Normalize cluster slots configuration by sorting replicas by node ID
proc normalize_cluster_slots {slots_config} {
set normalized {}
foreach slot_range $slots_config {
if {[llength $slot_range] <= 3} {
lappend normalized $slot_range
} else {
# Sort replicas (index 3+) by node ID, keep start/end/master unchanged
set replicas [lrange $slot_range 3 end]
set sorted_replicas [lsort -index 2 $replicas]
lappend normalized [concat [lrange $slot_range 0 2] $sorted_replicas]
}
}
return $normalized
}
# Check if cluster configuration is consistent.
proc cluster_config_consistent {} {
for {set j 0} {$j < $::cluster_master_nodes + $::cluster_replica_nodes} {incr j} {
if {$j == 0} {
set base_cfg [R $j cluster slots]
set base_secret [R $j debug internal_secret]
set normalized_base_cfg [normalize_cluster_slots $base_cfg]
} else {
set cfg [R $j cluster slots]
set secret [R $j debug internal_secret]
if {$cfg != $base_cfg || $secret != $base_secret} {
set normalized_cfg [normalize_cluster_slots $cfg]
if {$normalized_cfg != $normalized_base_cfg || $secret != $base_secret} {
return 0
}
}

View file

@ -83,7 +83,8 @@ TEST_MODULES = \
rdbloadsave.so \
crash.so \
internalsecret.so \
configaccess.so
configaccess.so \
atomicslotmigration.so
.PHONY: all

View file

@ -0,0 +1,523 @@
#include "redismodule.h"
#include <stdlib.h>
#include <memory.h>
#include <errno.h>
#define MAX_EVENTS 1024
/* Log of cluster events. */
const char *clusterEventLog[MAX_EVENTS];
int numClusterEvents = 0;
/* Log of cluster trim events. */
const char *clusterTrimEventLog[MAX_EVENTS];
int numClusterTrimEvents = 0;
/* Log of last deleted key event. */
const char *lastDeletedKeyLog = NULL;
int replicateModuleCommand = 0; /* Enable or disable module command replication. */
RedisModuleString *moduleCommandKeyName = NULL; /* Key name to replicate. */
RedisModuleString *moduleCommandKeyVal = NULL; /* Key value to replicate. */
/* Enable or disable module command replication. */
int replicate_module_command(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
if (argc != 4) {
RedisModule_ReplyWithError(ctx, "ERR wrong number of arguments");
return REDISMODULE_OK;
}
long long enable = 0;
if (RedisModule_StringToLongLong(argv[1], &enable) != REDISMODULE_OK) {
RedisModule_ReplyWithError(ctx, "ERR enable value");
return REDISMODULE_OK;
}
replicateModuleCommand = (enable != 0);
/* Set the key name and value to replicate. */
if (moduleCommandKeyName) RedisModule_FreeString(ctx, moduleCommandKeyName);
if (moduleCommandKeyVal) RedisModule_FreeString(ctx, moduleCommandKeyVal);
moduleCommandKeyName = RedisModule_CreateStringFromString(ctx, argv[2]);
moduleCommandKeyVal = RedisModule_CreateStringFromString(ctx, argv[3]);
RedisModule_ReplyWithSimpleString(ctx, "OK");
return REDISMODULE_OK;
}
int lpush_and_replicate_crossslot_command(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
if (argc != 3) return RedisModule_WrongArity(ctx);
/* LPUSH */
RedisModuleCallReply *rep = RedisModule_Call(ctx, "LPUSH", "!ss", argv[1], argv[2]);
RedisModule_Assert(RedisModule_CallReplyType(rep) != REDISMODULE_REPLY_ERROR);
RedisModule_FreeCallReply(rep);
/* Replicate cross slot command */
int ret = RedisModule_Replicate(ctx, "MSET", "cccccc", "key1", "val1", "key2", "val2", "key3", "val3");
RedisModule_Assert(ret == REDISMODULE_OK);
RedisModule_ReplyWithSimpleString(ctx, "OK");
return REDISMODULE_OK;
}
int testClusterGetLocalSlotRanges(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
static int use_auto_memory = 0;
use_auto_memory = !use_auto_memory;
RedisModuleSlotRangeArray *slots;
if (use_auto_memory) {
RedisModule_AutoMemory(ctx);
slots = RedisModule_ClusterGetLocalSlotRanges(ctx);
} else {
slots = RedisModule_ClusterGetLocalSlotRanges(NULL);
}
RedisModule_ReplyWithArray(ctx, slots->num_ranges);
for (int i = 0; i < slots->num_ranges; i++) {
RedisModule_ReplyWithArray(ctx, 2);
RedisModule_ReplyWithLongLong(ctx, slots->ranges[i].start);
RedisModule_ReplyWithLongLong(ctx, slots->ranges[i].end);
}
if (!use_auto_memory)
RedisModule_ClusterFreeSlotRanges(NULL, slots);
return REDISMODULE_OK;
}
/* Helper function to check if a slot range array contains a given slot. */
int slotRangeArrayContains(RedisModuleSlotRangeArray *sra, unsigned int slot) {
for (int i = 0; i < sra->num_ranges; i++)
if (sra->ranges[i].start <= slot && sra->ranges[i].end >= slot)
return 1;
return 0;
}
/* Sanity check. */
int sanity(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
RedisModule_Assert(RedisModule_ClusterCanAccessKeysInSlot(-1) == 0);
RedisModule_Assert(RedisModule_ClusterCanAccessKeysInSlot(16384) == 0);
RedisModule_Assert(RedisModule_ClusterCanAccessKeysInSlot(100000) == 0);
/* Call with invalid args. */
errno = 0;
RedisModule_Assert(RedisModule_ClusterPropagateForSlotMigration(NULL, NULL, NULL) == REDISMODULE_ERR);
RedisModule_Assert(errno == EINVAL);
/* Call with invalid args. */
errno = 0;
RedisModule_Assert(RedisModule_ClusterPropagateForSlotMigration(ctx, NULL, NULL) == REDISMODULE_ERR);
RedisModule_Assert(errno == EINVAL);
/* Call with invalid args. */
errno = 0;
RedisModule_Assert(RedisModule_ClusterPropagateForSlotMigration(NULL, "asm.keyless_cmd", "") == REDISMODULE_ERR);
RedisModule_Assert(errno == EINVAL);
/* Call outside of slot migration. */
errno = 0;
RedisModule_Assert(RedisModule_ClusterPropagateForSlotMigration(ctx, "asm.keyless_cmd", "") == REDISMODULE_ERR);
RedisModule_Assert(errno == EBADF);
RedisModule_ReplyWithSimpleString(ctx, "OK");
return REDISMODULE_OK;
}
/* Command to test RM_ClusterCanAccessKeysInSlot(). */
int testClusterCanAccessKeysInSlot(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(argc);
long long slot = 0;
if (RedisModule_StringToLongLong(argv[1],&slot) != REDISMODULE_OK) {
return RedisModule_ReplyWithError(ctx,"ERR invalid slot");
}
RedisModule_ReplyWithLongLong(ctx, RedisModule_ClusterCanAccessKeysInSlot(slot));
return REDISMODULE_OK;
}
/* Generate a string representation of the info struct and subevent.
e.g. 'sub: cluster-slot-migration-import-started, task_id: aeBd..., slots: 0-100,200-300' */
const char *clusterAsmInfoToString(RedisModuleClusterSlotMigrationInfo *info, uint64_t sub) {
char buf[1024] = {0};
if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_STARTED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-import-started, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_FAILED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-import-failed, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_COMPLETED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-import-completed, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_STARTED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-migrate-started, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_FAILED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-migrate-failed, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_COMPLETED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-migrate-completed, ");
else {
RedisModule_Assert(0);
}
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "source_node_id:%.40s, destination_node_id:%.40s, ",
info->source_node_id, info->destination_node_id);
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "task_id:%s, slots:", info->task_id);
for (int i = 0; i < info->slots->num_ranges; i++) {
RedisModuleSlotRange *sr = &info->slots->ranges[i];
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "%d-%d", sr->start, sr->end);
if (i != info->slots->num_ranges - 1)
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), ",");
}
return RedisModule_Strdup(buf);
}
/* Generate a string representation of the info struct and subevent.
e.g. 'sub: cluster-slot-migration-trim-started, task_id: aeBd..., slots:0-100,200-300' */
const char *clusterTrimInfoToString(RedisModuleClusterSlotMigrationTrimInfo *info, uint64_t sub) {
RedisModule_Assert(info);
char buf[1024] = {0};
if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_BACKGROUND)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-trim-background, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_STARTED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-trim-started, ");
else if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_COMPLETED)
snprintf(buf, sizeof(buf), "sub: cluster-slot-migration-trim-completed, ");
else {
RedisModule_Assert(0);
}
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "slots:");
for (int i = 0; i < info->slots->num_ranges; i++) {
RedisModuleSlotRange *sr = &info->slots->ranges[i];
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "%d-%d", sr->start, sr->end);
if (i != info->slots->num_ranges - 1)
snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), ",");
}
return RedisModule_Strdup(buf);
}
static void testReplicatingOutsideSlotRange(RedisModuleCtx *ctx, RedisModuleClusterSlotMigrationInfo *info) {
int slot = 0;
while (slot >= 0 && slot <= 16383) {
if (!slotRangeArrayContains(info->slots, slot)) {
break;
}
slot++;
}
char buf[128] = {0};
const char *prefix = RedisModule_ClusterCanonicalKeyNameInSlot(slot);
snprintf(buf, sizeof(buf), "{%s}%s", prefix, "modulekey");
errno = 0;
int ret = RedisModule_ClusterPropagateForSlotMigration(ctx, "SET", "cc", buf, "value");
RedisModule_Assert(ret == REDISMODULE_ERR);
RedisModule_Assert(errno == ERANGE);
}
static void testReplicatingCrossslotCommand(RedisModuleCtx *ctx) {
errno = 0;
int ret = RedisModule_ClusterPropagateForSlotMigration(ctx, "MSET", "cccccc", "key1", "val1", "key2", "val2", "key3", "val3");
RedisModule_Assert(ret == REDISMODULE_ERR);
RedisModule_Assert(errno == ENOTSUP);
}
static void testReplicatingUnknownCommand(RedisModuleCtx *ctx) {
errno = 0;
int ret = RedisModule_ClusterPropagateForSlotMigration(ctx, "unknowncommand", "");
RedisModule_Assert(ret == REDISMODULE_ERR);
RedisModule_Assert(errno == ENOENT);
}
static void testNonFatalScenarios(RedisModuleCtx *ctx, RedisModuleClusterSlotMigrationInfo *info) {
testReplicatingOutsideSlotRange(ctx, info);
testReplicatingCrossslotCommand(ctx);
testReplicatingUnknownCommand(ctx);
}
void clusterEventCallback(RedisModuleCtx *ctx, RedisModuleEvent e, uint64_t sub, void *data) {
REDISMODULE_NOT_USED(ctx);
int ret;
RedisModule_Assert(RedisModule_IsSubEventSupported(e, sub));
if (e.id == REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION) {
RedisModuleClusterSlotMigrationInfo *info = data;
if (sub == REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE) {
/* Test some non-fatal scenarios. */
testNonFatalScenarios(ctx, info);
if (replicateModuleCommand == 0) return;
/* Replicate a keyless command. */
ret = RedisModule_ClusterPropagateForSlotMigration(ctx, "asm.keyless_cmd", "");
RedisModule_Assert(ret == REDISMODULE_OK);
/* Propagate configured key and value. */
ret = RedisModule_ClusterPropagateForSlotMigration(ctx, "SET", "ss", moduleCommandKeyName, moduleCommandKeyVal);
RedisModule_Assert(ret == REDISMODULE_OK);
} else {
/* Log the event. */
if (numClusterEvents >= MAX_EVENTS) return;
clusterEventLog[numClusterEvents++] = clusterAsmInfoToString(info, sub);
}
}
}
void clusterTrimEventCallback(RedisModuleCtx *ctx, RedisModuleEvent e, uint64_t sub, void *data) {
REDISMODULE_NOT_USED(ctx);
RedisModule_Assert(RedisModule_IsSubEventSupported(e, sub));
if (e.id == REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM) {
/* Log the event. */
if (numClusterTrimEvents >= MAX_EVENTS) return;
RedisModuleClusterSlotMigrationTrimInfo *info = data;
clusterTrimEventLog[numClusterTrimEvents++] = clusterTrimInfoToString(info, sub);
}
}
static int keyspaceNotificationTrimmedCallback(RedisModuleCtx *ctx, int type, const char *event, RedisModuleString *key) {
REDISMODULE_NOT_USED(ctx);
RedisModule_Assert(type == REDISMODULE_NOTIFY_KEY_TRIMMED);
RedisModule_Assert(strcmp(event, "key_trimmed") == 0);
if (numClusterTrimEvents >= MAX_EVENTS) return REDISMODULE_OK;
/* Log the trimmed key event. */
size_t len;
const char *key_str = RedisModule_StringPtrLen(key, &len);
char buf[1024] = {0};
snprintf(buf, sizeof(buf), "keyspace: key_trimmed, key: %s", key_str);
clusterTrimEventLog[numClusterTrimEvents++] = RedisModule_Strdup(buf);
return REDISMODULE_OK;
}
/* ASM.PARENT SET key value (just proxy to Redis SET) */
static int asmParentSet(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
if (argc != 4) return RedisModule_WrongArity(ctx);
RedisModuleCallReply *reply = RedisModule_Call(ctx, "SET", "ss", argv[2], argv[3]);
if (!reply) return RedisModule_ReplyWithError(ctx, "ERR internal");
RedisModule_ReplyWithCallReply(ctx, reply);
RedisModule_FreeCallReply(reply);
RedisModule_ReplicateVerbatim(ctx);
return REDISMODULE_OK;
}
/* Clear both the cluster and trim event logs. */
int clearEventLog(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
for (int i = 0; i < numClusterEvents; i++)
RedisModule_Free((void *)clusterEventLog[i]);
numClusterEvents = 0;
for (int i = 0; i < numClusterTrimEvents; i++)
RedisModule_Free((void *)clusterTrimEventLog[i]);
numClusterTrimEvents = 0;
RedisModule_ReplyWithSimpleString(ctx, "OK");
return REDISMODULE_OK;
}
/* Reply with the cluster event log. */
int getClusterEventLog(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
RedisModule_ReplyWithArray(ctx, numClusterEvents);
for (int i = 0; i < numClusterEvents; i++)
RedisModule_ReplyWithStringBuffer(ctx, clusterEventLog[i], strlen(clusterEventLog[i]));
return REDISMODULE_OK;
}
/* Reply with the cluster trim event log. */
int getClusterTrimEventLog(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
RedisModule_ReplyWithArray(ctx, numClusterTrimEvents);
for (int i = 0; i < numClusterTrimEvents; i++)
RedisModule_ReplyWithStringBuffer(ctx, clusterTrimEventLog[i], strlen(clusterTrimEventLog[i]));
return REDISMODULE_OK;
}
/* A keyless command to test module command replication. */
int moduledata = 0;
int keylessCmd(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
moduledata++;
RedisModule_ReplyWithLongLong(ctx, moduledata);
return REDISMODULE_OK;
}
int readkeylessCmdVal(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
RedisModule_ReplyWithLongLong(ctx, moduledata);
return REDISMODULE_OK;
}
int subscribeTrimmedEvent(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
if (argc != 2)
return RedisModule_WrongArity(ctx);
long long subscribe = 0;
if (RedisModule_StringToLongLong(argv[1], &subscribe) != REDISMODULE_OK) {
RedisModule_ReplyWithError(ctx, "ERR subscribe value");
return REDISMODULE_OK;
}
if (subscribe) {
/* Unsubscribe first to avoid duplicate subscription. */
RedisModule_UnsubscribeFromKeyspaceEvents(ctx, REDISMODULE_NOTIFY_KEY_TRIMMED, keyspaceNotificationTrimmedCallback);
int ret = RedisModule_SubscribeToKeyspaceEvents(ctx, REDISMODULE_NOTIFY_KEY_TRIMMED, keyspaceNotificationTrimmedCallback);
RedisModule_Assert(ret == REDISMODULE_OK);
} else {
int ret = RedisModule_UnsubscribeFromKeyspaceEvents(ctx, REDISMODULE_NOTIFY_KEY_TRIMMED, keyspaceNotificationTrimmedCallback);
RedisModule_Assert(ret == REDISMODULE_OK);
}
RedisModule_ReplyWithSimpleString(ctx, "OK");
return REDISMODULE_OK;
}
void keyEventCallback(RedisModuleCtx *ctx, RedisModuleEvent e, uint64_t sub, void *data) {
REDISMODULE_NOT_USED(ctx);
REDISMODULE_NOT_USED(e);
if (sub == REDISMODULE_SUBEVENT_KEY_DELETED) {
RedisModuleKeyInfoV1 *ei = data;
RedisModuleKey *kp = ei->key;
RedisModuleString *key = (RedisModuleString *) RedisModule_GetKeyNameFromModuleKey(kp);
size_t keylen;
const char *keyname = RedisModule_StringPtrLen(key, &keylen);
/* Verify value can be read. It will be used to verify key's value can
* be read in a trim callback. */
size_t valuelen = 0;
const char *value = "";
RedisModuleKey *mk = RedisModule_OpenKey(ctx, key, REDISMODULE_READ);
if (RedisModule_KeyType(mk) == REDISMODULE_KEYTYPE_STRING) {
value = RedisModule_StringDMA(mk, &valuelen, 0);
}
RedisModule_CloseKey(mk);
char buf[1024] = {0};
snprintf(buf, sizeof(buf), "keyevent: key: %.*s, value: %.*s", (int) keylen, keyname, (int)valuelen, value);
if (lastDeletedKeyLog) RedisModule_Free((void *)lastDeletedKeyLog);
lastDeletedKeyLog = RedisModule_Strdup(buf);
}
}
int getLastDeletedKey(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
if (lastDeletedKeyLog) {
RedisModule_ReplyWithStringBuffer(ctx, lastDeletedKeyLog, strlen(lastDeletedKeyLog));
} else {
RedisModule_ReplyWithNull(ctx);
}
return REDISMODULE_OK;
}
int asmGetCommand(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(ctx);
if (argc != 2) return RedisModule_WrongArity(ctx);
RedisModuleKey *key = RedisModule_OpenKey(ctx, argv[1], REDISMODULE_READ);
if (key == NULL) {
RedisModule_ReplyWithNull(ctx);
return REDISMODULE_OK;
}
RedisModule_Assert(RedisModule_KeyType(key) == REDISMODULE_KEYTYPE_STRING);
size_t len;
const char *value = RedisModule_StringDMA(key, &len, 0);
RedisModule_ReplyWithStringBuffer(ctx, value, len);
RedisModule_CloseKey(key);
return REDISMODULE_OK;
}
int RedisModule_OnLoad(RedisModuleCtx *ctx, RedisModuleString **argv, int argc) {
REDISMODULE_NOT_USED(argv);
REDISMODULE_NOT_USED(argc);
if (RedisModule_Init(ctx, "asm", 1, REDISMODULE_APIVER_1) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.cluster_can_access_keys_in_slot", testClusterCanAccessKeysInSlot, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.clear_event_log", clearEventLog, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.get_cluster_event_log", getClusterEventLog, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.get_cluster_trim_event_log", getClusterTrimEventLog, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.keyless_cmd", keylessCmd, "write", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.read_keyless_cmd_val", readkeylessCmdVal, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.sanity", sanity, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.subscribe_trimmed_event", subscribeTrimmedEvent, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.replicate_module_command", replicate_module_command, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.lpush_replicate_crossslot_command", lpush_and_replicate_crossslot_command, "write", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.cluster_get_local_slot_ranges", testClusterGetLocalSlotRanges, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.get_last_deleted_key", getLastDeletedKey, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.get", asmGetCommand, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_CreateCommand(ctx, "asm.parent", NULL, "", 0, 0, 0) == REDISMODULE_ERR)
return REDISMODULE_ERR;
RedisModuleCommand *parent = RedisModule_GetCommand(ctx, "asm.parent");
if (!parent) return REDISMODULE_ERR;
/* Subcommand: ASM.PARENT SET (write) */
if (RedisModule_CreateSubcommand(parent, "set", asmParentSet, "write fast", 2, 2, 1) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_SubscribeToServerEvent(ctx, RedisModuleEvent_ClusterSlotMigration, clusterEventCallback) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_SubscribeToServerEvent(ctx, RedisModuleEvent_ClusterSlotMigrationTrim, clusterTrimEventCallback) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_SubscribeToKeyspaceEvents(ctx, REDISMODULE_NOTIFY_KEY_TRIMMED, keyspaceNotificationTrimmedCallback) == REDISMODULE_ERR)
return REDISMODULE_ERR;
if (RedisModule_SubscribeToServerEvent(ctx, RedisModuleEvent_Key, keyEventCallback) == REDISMODULE_ERR)
return REDISMODULE_ERR;
return REDISMODULE_OK;
}

View file

@ -13,6 +13,21 @@
#
# Cluster helper functions
# Normalize cluster slots configuration by sorting replicas by node ID
proc normalize_cluster_slots {slots_config} {
set normalized {}
foreach slot_range $slots_config {
if {[llength $slot_range] <= 3} {
lappend normalized $slot_range
} else {
# Sort replicas (index 3+) by node ID, keep start/end/master unchanged
set replicas [lrange $slot_range 3 end]
set sorted_replicas [lsort -index 2 $replicas]
lappend normalized [concat [lrange $slot_range 0 2] $sorted_replicas]
}
}
return $normalized
}
# Check if cluster configuration is consistent.
proc cluster_config_consistent {} {
@ -20,8 +35,12 @@ proc cluster_config_consistent {} {
if {$j == 0} {
set base_cfg [R $j cluster slots]
set base_secret [R $j debug internal_secret]
set normalized_base_cfg [normalize_cluster_slots $base_cfg]
} else {
if {[R $j cluster slots] != $base_cfg || [R $j debug internal_secret] != $base_secret} {
set cfg [R $j cluster slots]
set secret [R $j debug internal_secret]
set normalized_cfg [normalize_cluster_slots $cfg]
if {$normalized_cfg != $normalized_base_cfg || $secret != $base_secret} {
return 0
}
}
@ -119,6 +138,8 @@ proc cluster_setup {masters node_count slot_allocator code} {
# Start a cluster with the given number of masters and replicas. Replicas
# will be allocated to masters by round robin.
proc start_cluster {masters replicas options code {slot_allocator continuous_slot_allocation}} {
set ::cluster_master_nodes $masters
set ::cluster_replica_nodes $replicas
set node_count [expr $masters + $replicas]
# Set the final code to be the tests + cluster setup

View file

@ -245,6 +245,10 @@ proc s {args} {
status [srv $level "client"] [lindex $args 0]
}
proc S {index field} {
getInfoProperty [R $index info] $field
}
# Get the specified field from the givens instances cluster info output.
proc CI {index field} {
getInfoProperty [R $index cluster info] $field

File diff suppressed because it is too large Load diff

View file

@ -986,3 +986,49 @@ start_cluster 1 1 {tags {external:skip cluster} overrides {cluster-slot-stats-en
R 0 CONFIG RESETSTAT
R 1 CONFIG RESETSTAT
}
start_cluster 2 2 {tags {external:skip cluster} overrides {cluster-slot-stats-enabled yes}} {
test "CLUSTER SLOT-STATS reset upon atomic slot migration" {
# key on slot-0
set key0 "{06S}mykey0"
set key0_slot [R 0 CLUSTER KEYSLOT $key0]
R 0 SET $key0 VALUE
# Migrate slot-0 to node-1
R 1 CLUSTER MIGRATION IMPORT 0 0
wait_for_condition 1000 10 {
[CI 0 cluster_slot_migration_active_tasks] == 0 &&
[CI 1 cluster_slot_migration_active_tasks] == 0
} else {
fail "ASM tasks did not complete"
}
set expected_slot_stats [
dict create \
$key0_slot [ \
dict create key-count 1 \
dict create cpu-usec 0 \
dict create network-bytes-in 0 \
dict create network-bytes-out 0 \
]
]
set metrics_to_assert [list key-count cpu-usec network-bytes-in network-bytes-out]
# Verify metrics are reset except key-count
set slot_stats [R 1 CLUSTER SLOT-STATS SLOTSRANGE 0 0]
assert_empty_slot_stats_with_exception $slot_stats $expected_slot_stats $metrics_to_assert
# Migrate slot-0 back to node-0
R 0 CLUSTER MIGRATION IMPORT 0 0
wait_for_condition 1000 10 {
[CI 0 cluster_slot_migration_active_tasks] == 0 &&
[CI 1 cluster_slot_migration_active_tasks] == 0
} else {
fail "ASM tasks did not complete"
}
# Verify metrics are reset except key-count
set slot_stats [R 0 CLUSTER SLOT-STATS SLOTSRANGE 0 0]
assert_empty_slot_stats_with_exception $slot_stats $expected_slot_stats $metrics_to_assert
}
}