Redis Sentinel: Cluster failover solution (reprint)

Last Update:2015-03-11 Source: Internet

Author: User

Tags failover redis cluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article was reproduced from: http://shift-alt-ctrl.iteye.com/blog/1884370

The configuration examples in this paper, as well as the subscription events triggered during the failover process, are of great reference value.

Redis Sentinel (Sentinel) module has been integrated in the redis2.4+ version, although not currently release, but can try to use and understand, in fact Sentinel is still a bit complicated.
Sentinel's main function is to provide the Redis m-s (master,slaves) cluster with 1) Master Survival Detection 2) M-S service monitoring in the cluster 3) automatic failover, m-s role conversion capabilities, from one aspect is to improve the availability of Redis cluster.

In general, the minimum m-s unit has a maste and slave composition, when master fails, Sentinel can help us to automatically promote the slave to master, with Sentinel components, Can reduce the system administrator's manual switching slave operation process.

Some of Sentinel's design ideas are very similar to zookeeper, in fact, instead of using Sentinel, you can develop a ZK client that monitors Redis and can also complete the design requirements.

I. Environmental deployment

Prepare 3 Redis Services, simple to build a small m-s environment, their respective redis.conf configuration items, in addition to port different, require the same configuration (including Aof/snap,memory,rename and authorization password, etc.); The reason for this is that Sentinel-based failover, all of the server operating mechanisms must be the same, they are only different at runtime "role", and their roles may be converted in the event of failure; slave will become master at some point, although in general, Slave's data persistence approach often takes snapshot, while Master is aof, but after Sentinel, slave and master take AoF (via Bgsave, manually triggering snapshot backups).

1) redis.conf:

# #redis. conf
# #redis-0, default to master
Port 6379
# #授权密码, please keep the configuration consistent
Requirepass 012_345^678-
Masterauth 012_345^678-
# #暂且禁用指令重命名
# #rename-command
# #开启AOF, disable snapshot
AppendOnly Yes
Save ""
# #slaveof No one
Slave-read-only Yes

# #redis. conf
# #redis-1, configured as slave by the start parameter, the profile remains independent
Port 6479
Slaveof 127.0. 0.1 6379
# #-----------Other configurations and master remain consistent-----------# #

# #redis. conf
# #redis-1, configured as slave by the start parameter, the profile remains independent
Port 6579
Slaveof 127.0. 0.1 6379
# #-----------Other configurations and master remain consistent-----------# #

2) sentinel.conf

first create a new local-sentinel.conf in each Redis service sentinel.conf the same directory, and copy the following configuration information.

# #redis-0
# #sentinel实例之间的通讯端口
Port 26379
Sentinel Monitor Def_master 127.0. 0.1 6379 2
Sentinel Auth-pass def_master 012_345^678-
Sentinel Down-after-milliseconds def_master 30000
Sentinel Can-failover Def_master Yes
Sentinel Parallel-syncs def_master 1
Sentinel failover-timeout def_master 900000

# #redis-1
Port 26479
# #--------Other Configurations ditto-------# #

# #redis-2
Port 26579
# #--------Other configurations ditto-------#

3) Start-up and detection

# #redis-0 (default is master)
>./redis-server--include. /redis.conf
# #启动sentinel组件
>./redis-sentinel. /local-sentinel.conf

Follow the instructions above, start redis-0,redis-1,redis-2 in turn, and when you start Redis-1 and redis-2, you will see that the Sentinel console in redis-0 will output the word "+sentinel ...". Indicates that a new Sentinel instance is added to the monitor. However, it is important to remind you that the Master machine must be started first when you build your sentinel environment for the first time.

You can then use any of the "REDIS-CLI" windows and enter the "INFO" command to see the status of the current server:

>./redis-cli-h 127.0. 0.1-p 6379
# #如下为打印信息摘要:
#Replication
Role:master
Connected_salves:2
SLAVE0:127.0. 0.1,6479,online
SLAVE1:127.0. 0.1.6579,online

The "info" command will print the full service information, including the cluster, and we only need to focus on the "Replication" section, which will tell us "the role of the current server" and all the slave information that points to it. You can use the " Info "command to obtain the master information pointed to by the current slave.

The "info" instruction not only helps us to get the situation of the cluster, but the Sentinel component also uses "info" to do the same thing.

When the above deployment environment is stable, we directly shut down redis-0, after waiting for "down-after-milliseconds" seconds (30 seconds), Redis-0/redis-1/redis-2 's Sentinel window will immediately print "+sdown" " +odown "+failover" "+selected-slave" "+promoted-slave" "+slave-reconf" and so on a series of instructions indicating that when master fails, The Sentinel component carries out the failover process.

When the environment stabilized again, we found that REDIS-1 was promoted ("promoted") as Master, and Redis-2 followed redis-1 through the "slave-reconf" process.

If you want redis-0 to join the cluster again later, you will need to first find the current Masterip + port through the "INFO" command, and explicitly indicate the slaveof parameter in the startup instruction:

>./redis-server--include. /redis.conf--slaveof 127.0. 0.1 6479

Sentinel instances need to be started all the way, and if you start the server without initiating the appropriate sentinel, you still cannot ensure that the server is properly monitored and managed.

Two. Sentinel Principles

First of all, 2 nouns are explained: Sdown and Odown.

Sdown:subjectively down, the direct translation of the "subjective" failure, that is, the current Sentinel instance of a Redis service is considered "unavailable" status.
Odown:objectively down, directly translated as "objective" failure, that is, multiple Sentinel instances are considered master in the "Sdown" state, then master will be in Odown, Odown can simply be understood as master has been identified as "unavailable" by the cluster and will open failover.

The Sdown is suitable for both master and slave, but Odown is only used for master, and when the slave failure exceeds "Down-after-milliseconds", all Sentinel instances will be marked as "Sdown".

1) Sdown and Odown conversion process:

After each Sentinel instance is started, TCP connections are established with known Slaves/master and other Sentinels, and pings are periodically sent (default is 1 seconds)
In interaction, if redis-server cannot respond to or respond to an error message within the "Down-after-milliseconds" time, the Redis-server is considered to be in the Sdown state.
If the Sdown server in 2 is master, then the Sentinel instance will send "Is-master-down-by-addr <ip> <port>" to other Sentinel intermittently (one second) Command and get the response information, if enough sentinel instances detect that Master is in Sdown, the current Sentinel instance tag Master is Odown ... Other Sentinel instances do the same. Configuration Items Sentinel Monitor <mastername> <masterip> <masterport> <quorum>, If the number of slave that Master is in the Sdown state is detected to be <quorum>, then this sentinel instance will assume that Master is in Odown.
Each Sentinel instance will intermittently (10 seconds) Send an "info" instruction to master and slaves, and if master fails and no new master is selected, send "info" every 1 seconds; The main purpose of INFO is to obtain and confirm the survival of slaves and master in the current cluster environment.
After the above process, all Sentinel has agreed to master failure and begins to failover.

2) Sentinel and slaves "Autodiscover" mechanism:

In Sentinel's configuration file (local-sentinel.conf), port is specified, which is the port on which the Sentinel instance listens for other sentinel instances to establish a link. After the cluster is stable, Eventually, a TCP link is established between each Sentinel instance, which sends "PING" and similar to the "is-master-down-by-addr" instruction set, which can be used to detect the validity of other sentinel instances as well as "Odown" and " Failover "The interaction of information in the process.
Before establishing a connection between Sentinel, Sentinel will try to establish a connection with the master specified in the configuration file. The communication between Sentinel and Master is primarily based on Pub/sub to publish and receive information, The information that is published includes the listening port for the current Sentinel instance:

+sentinel Sentinel 127.0. 0.1:26579 127.0. 0.1 26579 ....

The topic name of the publication is "__sentinel__:hello" and the Sentinel instance is also "subscribed" to this topic for information on other Sentinel instances. Thus, when the environment is first built, the default master survives, All Sentinel instances can receive all sentinel information through Pub/sub, and each Sentinel instance can then be based on the "Ip+port" in the +sentinel information Make a TCP connection to each other Sentinel one at a point. However, it should be recalled that each Sentinel instance will intermittently (5 seconds) publish its own ip+port to the "__sentinel__:hello" topic. The goal is to let the Sentinel instances that are subsequently joined to the cluster also be able to get their own information.
According to the above, we know that in the case of master is valid, we can get the slave list in the current master through the "INFO" command, then any slave join the cluster, Master will publish the +slave 127.0.0.1:6579 to "topic". , all Sentinel will also get slave information immediately and establish a link with slave and ping to detect its viability.

To add, each sentinel instance will hold a list of other Sentinel instances, as well as an existing list of master/slaves, with no duplicate information in the respective list (multiple TCP connections are not possible), and for Sentinel will use ip+ The port makes a uniqueness tag, and for Master/slaver will use Runid to make a unique tag where the Runid of Redis-server is different at each boot.

3) leader election:

In fact, in Sentinels failover, a "Leader" is still required to dispatch the entire process: Master election and slave reconfiguration and synchronization. When there are multiple Sentinel instances in a cluster, how do you elect one Sentinel as leader?

Complete the process by "Can-failover" the "quorum" parameter in the configuration file, along with the "is-master-down-by-addr" directive.

A) "Can-failover" is used to indicate whether the current Sentinel can participate in the "failover" process, and if "YES" indicates that it will be able to participate in the "Leader" election, otherwise it will act as "Observer", Observer participate in leader elections but cannot be elected;

B) "Quorum" is not only used to control the master Odown status confirmation, but also used to elect the minimum "approval" number of leader;

C) "Is-master-down-by-addr", as mentioned above, and it can be used to detect whether "IP + port" Master is already in the Sdown state, but this instruction can not only get Master is in Sdown, It also returns additional leader information (Runid) for the current Sentinel local "poll".

Each Sentinel instance holds additional sentinels information, and during the leader election process (when the Sentinel instance for leader is invalidated, it is possible that the master server is not invalidated, and that it is understood separately), The Sentinel instance removes "Can-failover = no" from all sentinels collections and sentinels with a status of Sdown, sorted by Sentinels in the remaining Runid list after "dictionary" order. Take out the smallest sentinel instance of Runid and "vote" for leader, and append the selected Runid to the response when other Sentinel sends the "IS-MASTER-DOWN-BY-ADDR" command. Each Sentinel instance detects the result of a "is-master-down-by-addr" response, and if the "voting" leader is itself, and the Sentinels instance is normal, the number of "endorses" 's own Sentinel is not less than ( >=) 50% + 1, and not small with <quorum>, then this sentinel will consider the election successful and leader for itself.

In the sentinel.conf file, we expect to have enough Sentinel instances configured with "Can-failover yes" to ensure that when leader fails, a Sentinel can be elected leader for failover. If leader cannot be generated, such as fewer sentinels instances are valid, then the failover process cannot continue.

4) Failover Process:

Before leader triggers failover, wait a few seconds (then 0~5) so that other sentinel instances can be prepared and adjusted (possibly multiple leader??), and if everything is OK, then leader will need to start promoting a salve to master. This slave must be in good condition (cannot be in Sdown/odown state) and has the lowest weight value (redis.conf), when master identity is confirmed, start failover

A) "+failover-triggered": Leader began failover, followed by "+failover-state-wait-start", wait for a few seconds.

B) "+failover-state-select-slave": Leader start looking for the right slave

C) "+selected-slave": a suitable slave has been found

D) "+failover-state-sen-slaveof-noone": Leader sends the "slaveof no one" instruction to Slave, at which point Slave has completed the role conversion, this slave is the master

E) "+failover-state-wait-promotition": Wait for other Sentinel confirmation slave

F) "+promoted-slave": Confirm success

G) "+failover-state-reconf-slaves": Start the reconfig operation on the slaves.

H) "+slave-reconf-sent": Sends the "slaveof" instruction to the specified slave, informing the slave to follow the new master

I) "+slave-reconf-inprog": This slave is performing the slaveof + sync process, +slave-reconf-sent will be performed after slave receives "slaveof".

J) "+slave-reconf-done": This slave completes synchronously, and thereafter leader can continue the reconfig operation of the next slave. Cycle g)

K) "+failover-end": End of failover

L) "+switch-master": After a successful failover, each Sentinel instance starts to monitor the new master.

Three. Sentinel.conf detailed

# #sentinel实例之间的通讯端口
# #redis-0
Port 26379
# #sentinel需要监控的master信息:<mastername> <masterIP> <masterPort> <quorum>
##<quorum> should be smaller than the number of slave in the cluster, only if at least <quorum> Sentinel instance submits "Master failed"
# #才会认为master为O_DWON ("objective" expiration)
Sentinel Monitor Def_master 127.0. 0.1 6379 2
Sentinel Auth-pass def_master 012_345^678-
# #master被当前sentinel实例认定为 "Dead" time interval
# #如果当前sentinel与master直接的通讯中, there is no response or response error code within the specified time, then
# #当前sentinel就认为master失效 (Sdown, "subjective" expiration)
##<mastername> <millseconds>
# #默认为30 seconds
Sentinel Down-after-milliseconds def_master 30000
# #当前sentinel实例是否允许实施 "Failover" (fail over)
# #no表示当前sentinel为 "Observer" ( vote only). Do not participate in the implementation of failover),
# #全局中至少有一个为yes
Sentinel Can-failover Def_master Yes
# #当新master产生时 The number of slave that "slaveof" to the new master and "SYNC".
# #默认为1, it is recommended to keep the default value
# #在salve执行salveof与同步时, the client request will be terminated.
# #此值较大, which means that the "cluster" terminates the sum and large amount of time the client requests.
# #此值较小 means that the "cluster" still uses the old data during failover, when multiple salve serve clients.
Sentinel Parallel-syncs def_master 1
# #failover过期时间, when failover starts, no failover action is triggered during this time,
# #当前sentinel将会认为此次failoer失败.
Sentinel failover-timeout def_master 900000
# #当failover时, you can specify a "notification" script to inform the system administrator of the current cluster situation.
# #脚本被允许执行的最大时间为60 seconds, if timed out, the script will be terminated (KILL)
# #脚本执行的结果:
# # 1, retry later, maximum retry count is 10;
# # 2, end of execution, no retries required
# #sentinel Notification-script mymaster/var/redis/notify.sh
# #failover之后重配置客户端, a large number of parameters are passed when executing the script, please refer to the relevant documentation
# Sentinel Client-reconfig-script <master-name> <script-path>

For more information, please refer to SRC/SENTINEL.C source code. Configuration file Loading procedure See method: Sentinelhandlerconfiguration (..)

Redis Sentinel: Cluster failover solution (reprint)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More