Redis cluster specification notes

Last Update:2014-07-09 Source: Internet

Author: User

Tags redis cluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Ref: http://redis.io/topics/cluster-spec

1. design goals: high performance; linear expansion; merge operations not supported; write operation security: Low Probability discard; (for each key) only one slave is available;

Redis cluster is a distributed implementation of redis with the following goals, in order of importance in the design:

High Performance and linear scalability up to 1000 nodes.
No merge operations in order to play well with values size and semantics typical of the redis data model.
Write safety: the system tries to retain all the writes originating from clients connected with the majority of the nodes. However there are small windows where acknowledged writes can be lost.
Availability: redis cluster is able to keep ve to partitions where the majority of the master nodes are reachable and there is at least a reachable slave for every master node that is no longer reachable.

What is described in this document is implemented inunstableBranch of the GitHub redis repository. redis cluster has now entered the beta stage, so new Betas are released every month and can be found in the download page of The redis web site.

2. Use the hash tag to route a key to a fixed node;

3. Communication Protocol:

Communication through a binary protocol of "cluster bus;
Each point establishes a TCP link with all other nodes (this is not linear expansion );
The client can initiate a link to any request, but the node does not use the proxy function. Instead, it returns a redirection error message like HTTP;

4. Secure writing: There are two possible risks of data loss:

Redis cluster tries hard to retain all the writes that are hosted med by clients connected to the majority of masters, with two exceptions:

1) A write may reach a master, but while the master may be able to reply to the client, the write may not be propagated to slaves via the asynchronous replication used between master and slave nodes. if the master dies without the write reaching the slaves, the write is lost forever in case the master is unreachable for a long enough period that one of its slaves is promoted.

2) Another theoretically possible failure mode where writes are lost is the following:

A master is unreachable because of a partition.
It gets failed over by one of its slaves.
After some time it may be reachable again.
A client with a not updated routing table may write to it before the master is converted to a slave (of the new master) by the cluster.

5 Availability: when the network is split, one side of most servers can be used normally, and the other side cannot be used. It is not applicable to large-scale network faults. For any key, as long as a master or slave exists, access can be normal;

6. performance: (each node) is basically the same as a single redis (this is called linear performance growth );

7. Why is merge not supported: performance considerations;

8. Key Distribution: first CRC, then modulo allocation from 16 K shards (slot), and then distributed to each node;

HASH_SLOT = CRC16(key) mod 16384

The crc16 is specified as follows:

Name: XMODEM (also known as zmodem or CRC-16/ACORN)
Width: 16 Bit
Poly: 1021 (that is actually x16 + X12 + X5 + 1)
Initialization: 0000
Reflect input byte: false
Reflect output CRC: false
XOR constant to output CRC: 0000
Output for "123456789": 31c3

9. Keys hash tag: in key "{tag} otherstring", the tag is hash tags, which is used to calculate the slot location of the key. In order to map the key of the same tag to the same slot first. 10. Node attribute: the node identifier is a random number. It is written to the configuration file during the first running and remains unchanged;

Every node has other associated information that all the other nodes know:

The IP address and TCP port where the node is located.
A set of flags.
A set of hash slots served by the node.
Last time we sent a ping packet using the cluster bus.
Last time we performed ed a pong packet in reply.
The time at which we flagged the node as failing.
The number of slaves of this node.
The master node ID, if this node is a slave (or 0000000... if it is a master ).

11. Cluster topology: full connection and long TCP link.

Redis cluster is a full mesh where every node is connected with every other node using a TCP connection.

In a cluster of N nodes, every node has N-1 outgoing TCP connections, and N-1 of incoming connections.

These TCP connections are kept alive all the time and are not created on demand.

12. Inter-node communication: when a new node is added, only the administrator can initiate the meet message. The meet message will be transmitted in the cluster.

13. Redirection policy

A redis client is free to send queries to every node in the cluster, including slave nodes. the node will analyze the query, and if it is acceptable (that is, only a single key is mentioned in the query) it will see what node is responsible for the hash slot where the key belongs.

If the hash slot is served by the node, the query is simply processed, otherwise the node will check its internal hash slot-> node ID map and will reply to the client with a moved error.

A moved error is like the following:

GET x-MOVED 3999 127.0.0.1:6381

14. Key Migration:

The following subcommands are available:

Cluster addslots slot1 [slot2]... [slotn]
Cluster delslots slot1 [slot2]... [slotn]
Cluster setslot slot Node
Cluster setslot slot migrating Node
Cluster setslot slot importing Node
```
CLUSTER GETKEYSINSLOT slot count
```

MIGRATE target_host target_port key target_database id timeout

15. Ask redirection: query the location of a key

16. Client redirection: the client should properly record the relationship between the key and the slot (reduce the number of redirect requests) and handle redirect error messages;

17. Multiple keys. Multiple keys operations

Using hash tags clients are free to use multiple-keys operations. For example the following operation is valid:

MSET {user:1000}.name Angela {user:1000}.surname White

18 Fault Tolerance:

Inter-node heartbeat Detection: sends a random number of nodes to make the total heartbeat of the entire cluster n;
Heartbeat packet content:

The common header has the following information:

Node ID, that is a 160 bit login udorandom string that is assigned the first time a node is created and remains the same for all the life of a redis cluster node.
ThecurrentEpochAndconfigEpochField, that are used in order to mount the distributed algorithms used by redis cluster (this is explained in details in the next sections). If the node is a slaveconfigEpochIs the last knownconfigEpochOf the master.
The node flags, indicating if the node is a slave, a master, and other single-bit node information.
A bitmap of the hash slots served by a given node, or if the node is a slave, a bitmap of the slots served by its master.
Port: the sender TCP base port (that is, the port used by redis to accept client commands, add 10000 to this to obtain the cluster port ).
State: the State of the cluster from the point of view of the sender (down or OK ).
The master node ID, if this is a slave.

19 failed node Detection: pfail/fail flag. When a heartbeat detection B fails, a indicates that B is pfail; then a asks other nodes. If most nodes return B as pfail, A indicates that B is fail, and notify all other Node B to be fail.

This mechanism is used in order to escalatePFAILCondition toFAILCondition, when the following set of conditions are met:

Some node, that we'll call a, has another node B flaggedPFAIL.
Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.
The majority of masters signaledPFAILOrPFAILConditionNODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULTTime.

If all the above conditions are true, node A will:

Mark the nodeFAIL.
SendFAILMessage to all the reachable nodes.

TheFAILMessage will force every refreshing node to mark the node inFAILState.

20. logical clock: Cluster epoch

21. Upgrade slave to master: It detects master failure-"Slave initiates election-" The Server Load balancer won the election to change itself to master. Election Process:

Slave A sends failover_auth_request to all other masters and waits for a response (at least node_timeout * 2 );
After receiving the failover_auth_request request, other masters decide to respond to the failover_auth_ack message if they agree, and do not agree to other requests within 2 * node_timeout (similar to ZK)
If slave a receives an ACK response from a large number of (more than half) Masters, it wins the election and broadcasts messages about winning the election. Then you can go to the master.

22. Key Slot Allocation and information dissemination.

Rule 1: If an hash slot is unassigned, and a known node claims it, I'll modify my hash slot table to associate the hash slot to this node.

Rule 2: If an hash slot is already assigned, and a known node is advertising it usingconfigEpochThat is greater thanconfigEpochAdvertised by the current owner of the slot, I'll rebind the hash slot to the new node.

22. Publish and Subscribe: publish or subscribe can be sent to any node. The correct node will be notified within the cluter;

Local

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More