Redis cluster Specification 筆記

最後更新：2014-07-09 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：des style http color 使用檔案

　　ref： http://redis.io/topics/cluster-spec

1. 設計目標：高效能；線性擴充；不支援合併作業；寫操作安全：小機率丟棄；（對於每個key）只要有一個slave工作，就可用；

Redis Cluster is a distributed implementation of Redis with the following goals, in order of importance in the design:

High performance and linear scalability up to 1000 nodes.
No merge operations in order to play well with values size and semantics typical of the Redis data model.
Write safety: the system tries to retain all the writes originating from clients connected with the majority of the nodes. However there are small windows where acknowledged writes can be lost.
Availability: Redis Cluster is able to survive to partitions where the majority of the master nodes are reachable and there is at least a reachable slave for every master node that is no longer reachable.

What is described in this document is implemented in the unstable branch of the Github Redis repository. Redis Cluster has now entered the beta stage, so new betas are released every month and can be found in the download page of the Redis web site.

2. 使用hash tag實現將一個key總是路由到固定的節點；

3. 通訊協定：

通過一個“cluster bus”的二進位協議通訊；
每個幾點都和其他所有節點建立tcp連結（在這點就不是線性擴充的）；
client發起連結給任何請求都是允許的，但是節點不做proxy功能，而是像http那樣，返回一個重新導向的錯誤資訊；

4. 安全寫：存在兩個丟資料的可能：

Redis Cluster tries hard to retain all the writes that are performed by clients connected to the majority of masters, with two exceptions:

1) A write may reach a master, but while the master may be able to reply to the client, the write may not be propagated to slaves via the asynchronous replication used between master and slave nodes. If the master dies without the write reaching the slaves, the write is lost forever in case the master is unreachable for a long enough period that one of its slaves is promoted.

2) Another theoretically possible failure mode where writes are lost is the following:

A master is unreachable because of a partition.
It gets failed over by one of its slaves.
After some time it may be reachable again.
A client with a not updated routing table may write to it before the master is converted to a slave (of the new master) by the cluster.

5 可用性：當網路分裂時，包含多數server的一側可以正常使用，另一側不可以使用；不適用於大規模網路故障的情境；對任何一個key，只要有一個master或slave存在，就能正常訪問；

6. 效能：（每個節點）與單個redis基本相同（這就是所謂的效能線性增長）；

7. 為什麼不支援merge操作：效能考慮；

8. key的分布：先CRC然後模數分配從16k的分區（slot），然後在分配到各個node上；

HASH_SLOT = CRC16(key) mod 16384

The CRC16 is specified as follows:

Name: XMODEM (also known as ZMODEM or CRC-16/ACORN)
Width: 16 bit
Poly: 1021 (That is actually x16 + x12 + x5 + 1)
Initialization: 0000
Reflect Input byte: False
Reflect Output CRC: False
Xor constant to output CRC: 0000
Output for "123456789": 31C3

9 。 keys hash tag: 在key “{tag}otherString”中，tag就是hash tags，用於計算這個key的slot位置，為了實現先同tag的key映射到相同的slot中。10。 node屬性：node的標識是一個隨機數，第一次運行時寫入到設定檔，並保持不變；

Every node has other associated information that all the other nodes know:

The IP address and TCP port where the node is located.
A set of flags.
A set of hash slots served by the node.
Last time we sent a ping packet using the cluster bus.
Last time we received a pong packet in reply.
The time at which we flagged the node as failing.
The number of slaves of this node.
The master node ID, if this node is a slave (or 0000000... if it is a master).

11. Cluster topology 拓撲：全串連且長tcp連結。

Redis cluster is a full mesh where every node is connected with every other node using a TCP connection.

In a cluster of N nodes, every node has N-1 outgoing TCP connections, and N-1 incoming connections.

These TCP connections are kept alive all the time and are not created on demand.

12. 節點間通訊：新節點加入時，只有管理員才能發起MEET訊息；MEET訊息會在cluster中傳播。

13. 重新導向策略

A Redis client is free to send queries to every node in the cluster, including slave nodes. The node will analyze the query, and if it is acceptable (that is, only a single key is mentioned in the query) it will see what node is responsible for the hash slot where the key belongs.

If the hash slot is served by the node, the query is simply processed, otherwise the node will check its internal hash slot -> node ID map and will reply to the client with a MOVED error.

A MOVED error is like the following:

GET x-MOVED 3999 127.0.0.1:6381

14. key遷移：

The following subcommands are available:

CLUSTER ADDSLOTS slot1 [slot2] ... [slotN]
CLUSTER DELSLOTS slot1 [slot2] ... [slotN]
CLUSTER SETSLOT slot NODE node
CLUSTER SETSLOT slot MIGRATING node
CLUSTER SETSLOT slot IMPORTING node
```
CLUSTER GETKEYSINSLOT slot count
```

MIGRATE target_host target_port key target_database id timeout

15. Ask redirection: 查詢一個key的位置

16. Client處理重新導向： client應該適當記錄key與slot的關係（減少redirect的次數），並且處理redirect的錯誤資訊;　　

17 . 多key。Multiple keys operations

Using hash tags clients are free to use multiple-keys operations. For example the following operation is valid:

MSET {user:1000}.name Angela {user:1000}.surname White

18 容錯：

節點間心跳檢測：隨機發給隨機數量的節點，使得整個cluster的總心跳數在N的規模；
心跳報文內容:

The common header has the following information:

Node ID, that is a 160 bit pseudorandom string that is assigned the first time a node is created and remains the same for all the life of a Redis Cluster node.
The currentEpoch and configEpoch field, that are used in order to mount the distributed algorithms used by Redis Cluster (this is explained in details in the next sections). If the node is a slave the configEpoch is the last known configEpoch of the master.
The node flags, indicating if the node is a slave, a master, and other single-bit node information.
A bitmap of the hash slots served by a given node, or if the node is a slave, a bitmap of the slots served by its master.
Port: the sender TCP base port (that is, the port used by Redis to accept client commands, add 10000 to this to obtain the cluster port).
State: the state of the cluster from the point of view of the sender (down or ok).
The master node ID, if this is a slave.

19 失效節點檢測： PFAIL/FAIL標誌。當A心跳檢測B失敗，A標誌為B為PFAIL；然後A詢問其他的node，如果多數節點返回B為PFAIL，則A標誌B為FAIL，並通知其他所有的node B為FAIL。

This mechanism is used in order to escalate a PFAIL condition to a FAIL condition, when the following set of conditions are met:

Some node, that we‘ll call A, has another node B flagged as PFAIL.
Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.
The majority of masters signaled the PFAIL or PFAIL condition within NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT time.

If all the above conditions are true, Node A will:

Mark the node as FAIL.
Send a FAIL message to all the reachable nodes.

The FAIL message will force every receiving node to mark the node in FAIL state.

20. 邏輯時鐘：Cluster epoch

21. Slave提升為Master：檢測到master失效-》slave發起選舉-》贏得選舉的slave將自己變為master. 選舉過程：

slave A發送FAILOVER_AUTH_REQUEST給其他所有的master，並等待回應（至少NODE_TIMEOUT*2時間長度）；
其他master收到FAILOVER_AUTH_REQUEST請求後，決定如果同意就回應FAILOVER_AUTH_ACK訊息，並且在2*NODE_TIMEOUT時間內不在同意其他請求（和zk的類似）
slave A 如果收到大多少（超過半數）master的ack回應，則贏得選舉，廣播自己贏得選舉的訊息。然後就可以升為master了。

22. key slot分配與資訊傳播。

Rule 1: If an hash slot is unassigned, and a known node claims it, I‘ll modify my hash slot table to associate the hash slot to this node.

Rule 2: If an hash slot is already assigned, and a known node is advertising it using a configEpoch that is greater than theconfigEpoch advertised by the current owner of the slot, I‘ll rebind the hash slot to the new node.

22. publish and subscribe: 可以向任何一個節點publish或subscribe，cluter內部會通知到正確的節點；

地方

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More