Redis Series--Deep Sentinel cluster

Last Update:2018-08-29 Source: Internet

Author: User

Tags failover redis redis cluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, preface

In the previous series, we introduced the introduction, persistence, and replication capabilities of Redis, and if you do not know please go to the Redis series for reading, of course, I also hold the knowledge of learning to share, if there are any questions welcome correct, also welcome everyone. And this time will introduce sentinel cluster related knowledge, including Sentinel cluster deployment, Sentinel principle, related configuration, failover and other content, because Redis has sentinel mechanism, and in many enterprises (including the author's own company) using Sentinel mode Redis master and slave.

Ii. Sentinel (Sentinel) profile

Sentinel (hereafter known as Sentinel) is a highly available (HA) solution that is officially recommended. In the previous article, the master-slave high-availability solution for Redis was introduced, the disadvantage of which is that when master fails, manual failback is required, and Sentinel is a standalone process that monitors one or more master and slave clusters, And can automatically failover in the master failure time, more ideal is Sentinel itself is a distributed system, its distributed design idea is somewhat similar to zookeeper, when a master failure, Sentinel cluster uses raft algorithm to select leader, and failover is done by leader. For the client, to operate the Redis master node, we only need to ask Sentinel,sentinel to return the currently available master, so that the client does not need to be concerned about the switchover that is raised by the client configuration changes. A typical sentinel architecture, such as:

Sentinel's main features:

Monitoring (Monitoring): Sentinel constantly checks whether master and slave are functioning properly.
Notification (Notification): Sentinel (Sentinel) can send notifications to administrators or other applications through the API when there is a problem with a redis instance being monitored.
Automatic failover (Automatic failover): When a master does not work, Sentinel starts an automatic failover operation that upgrades one of the failed master slave to the new master. And let the other slave of the failed master change to copy the new master; When the client tries to connect to the failed master, the cluster also returns the address of the new master to the client so that the cluster can use master instead of the failed master.
Configuration Center (config provider): Sentinel returns a new master address if failover occurs.

III. Sentinel Cluster deployment environment planning

The deployment process will deploy three Sentinel nodes to monitor a master two from the Redis cluster, the master-slave construction process can refer to the author blog "Redis series-master-slave replication and Redis replication Evolution", the following are the environmental rules:

Sentinel node: 10.1.210.32:26379, 10.1.210.33:26379, 10.1.210.34:26379
Redis instances: 10.1.210.69:6379 (Master), 10.1.210.69:6380 (slave), 10.1.210.69:6381 (from)

Installation configuration

Sentinel installation is consistent with the Redis installation process, please refer to the Redis series article, and in the source code, Redis provides a reference configuration example sentinel.conf (can be viewed using the command grep-e-v ^# sentinel.conf), as follows:

Port 26379/127.0.0.1 6379 2Sentineldown-after-milliseconds mymaster 30000Sentinel Parallel-syncs MyMaster 1Sentinel failover-timeout MyMaster 180000

Configuration Description:

Sentinel Monitor MyMaster 127.0.0.1 6379 2

This line of configuration represents Sentinel monitoring's master name called MyMaster (which can be taken by itself), the address is 127.0.0.1, and the port is 6379. The last 2 indicates that the master is not really available when 2 Sentinel in the Sentinel Cluster considers the master failure. The official call this parameter quorum, will be used in the follow-up election lead Sentinel, will be introduced below.

Sentinel Down-after-milliseconds MyMaster 30000

Sentinel sends a heartbeat ping to master to confirm that Master is alive, and if Master does not respond to pong within a "certain timeframe" or if it responds to an error message, then this Sentinel will subjectively (unilaterally) Think this master is no longer available (subjectively down, also referred to as Sdown). And this down-after-milliseconds is used to specify this "a certain time range", the unit is milliseconds, in this case, the 30 second time Master does not respond to Pong is not subjective.

Sentinel Parallel-syncs MyMaster 1

This configuration indicates how many slave are allowed to synchronize the new master at the same time in the event of a failover primary and standby switchover. The smaller the number, the longer it takes to complete the failover, but the larger the number, the more slave is not available because of replication. This value can be set to ensure that only one slave is in a state that cannot handle a command request at a time.

Sentinel Failover-timeout MyMaster 180000

Failover-time timeout period, when failover started, no failover operation was triggered in this time, the current Sentinel will consider this failover failure, in milliseconds.

It is not difficult to find that sentinel configuration is fixed in the following format:

Sentinel <option_name> <master_name> <option_value>

Running Sentinel

There are two ways to start Sentinel, both of which must specify a configuration file:

# The first (recommended)redis-sentinel/path/to/sentinel.conf# The second kind of redis-server/path/to/ Sentinel.conf--sentinel

The following is the author three node configuration file, and simultaneously copied to three nodes to start:

 bind 10.1.210.32 #  ip address  port  26379  port  dir /opt/db/redis  data store directory  daemonize Yes  #   background run  logfile /opt/db/redis/sentinel.log  Log  sentinel Monitor MyMaster  10.1.210.69 6379 2 sentinel down -after-milliseconds MyMaster 30000 sentinel parallel -syncs mymaster 1sentinel Failover -timeout mymaster 180000

10.1.210.32

 bind 10.1.210.33 #  ip address  port  26379  port  dir /opt/db/redis  data store directory  daemonize Yes  #   background run  logfile /opt/db/redis/sentinel.log  Log  sentinel Monitor MyMaster  10.1.210.69 6379 2 sentinel down -after-milliseconds MyMaster 30000 sentinel parallel -syncs mymaster 1sentinel Failover -timeout mymaster 180000

10.1.210.33

 bind 10.1.210.34 #  ip address  port  26379  port  dir /opt/db/redis  data store directory  daemonize Yes  #   background run  logfile /opt/db/redis/sentinel.log  Log  sentinel Monitor MyMaster  10.1.210.69 6379 2 sentinel down -after-milliseconds MyMaster 30000 sentinel parallel -syncs mymaster 1sentinel Failover -timeout mymaster 180000

10.1.210.34

Starting each Sentinel through redis-sentinel/opt/db/redis/sentinel.conf, the following is the boot log (you can find Sentinel automatically discovers slave and other Sentinel through Master):

At this point, a sentinel cluster is built to complete.

Sentinel Related commands

Like Redis, Sentinel can be manipulated by clients using commands, such as viewing the master status SENTINEL masters, for example:

Here are all the commands and explanations:

SENTINEL Masters#lists all monitored master and the current master statusSENTINEL Master <master name>#lists the specified masterSENTINEL Slaves <master name>#lists all slave and slave states for a given masterSENTINEL Sentinels <master Name>#Lists all sentinel monitors for the specified masterSENTINEL get-master-addr-by-name <master name>#returns the IP address and port number of the server given the master nameSENTINEL Reset <pattern>#Resets the master state of all matching pattern expressionsSENTINEL Failover <master name>#when Msater fails, forces an automatic failover to start without asking other Sentinel comments, but it sends an up-to-date configuration to other Sentinel, and other Sentinel updates are based on this configurationSENTINEL Ckquorum <master name>#Check that the current sentinel configuration can reach the required number of failover master, this command can be used to detect if Sentinel deployment is normal and return OKSENTINEL Flushconfig#Force Sentinel to write runtime configuration to disk, including current Sentinel state

Iv. Sentinel principles Sdown and Odown

Before you introduce the Sentinel principle, you need to understand the two concepts Sdown and Odown:

Sdown: Full spell subjectively down, called subjective downline, refers to a single Sentinel to the Redis instance to make the downline status judgment.
Odown: Objectively down, called client downline, refers to multiple sentinel instances in the same redis to make sdown judgment, and through Sentinel is-master-down-by-addr command to communicate with each other, The resulting Redis instance is judged offline. (One Sentinel can ask another Sentinel to see if the given Redis instance is offline by sending a Sentinel IS-MASTER-DOWN-BY-ADDR command to the other.) ）

From the Sentinel's point of view, if the ping heartbeat is sent, after a certain amount of time has not received a legitimate reply, it reached the sdown condition. This time is configured through the Master-down-after-milliseconds parameter in the configuration.

When Sentinel sends a ping, one of the following replies is considered legitimate:

PING replied with +--masterdown error.

Any other reply (or no reply at all) is illegal.

Switching from Sdown to Odown does not require any consistency algorithm, only one gossip protocol: If a Sentinel receives enough Sentinel messages to tell it that a master has been dropped, the Sdown status becomes Odown. If the master is available later, the status will be cleaned up accordingly.

Real failover requires a process of authorization, which is the process of leader selection, but all failover begin in a odown state. The Odown state applies only to master, and no negotiation is required between the Redis node Sentinel that is not master, and slaves and Sentinel will not have Odown status.

Implementation principle

When a sentinel starts, it reads the configuration file and finds the primary database to monitor through Sentinel Monitor <master-name> <ip> <port> <quorum> configuration. This configuration has been described in detail before, where Master-name is a database master name consisting of a uppercase and lowercase letters, numbers, and ". _-", in order to take into account that the IP address and port of the main library may change after failover, so IP and port are required to mark the main library. A Sentinel can monitor multiple master and slave systems to form a mesh structure, as illustrated in the previous introduction.

When Sentinel starts, it establishes two connections to the monitored database, such as the Sentinel node (two) link to the main library 10.1.210.69:6379 and 10.1.210.32:

These two connections are the same as the normal client, where one connection is used to subscribe to master's __sentinel__:hello channel to obtain additional sentinel node information to monitor the database, and one for Sentinel to periodically send information such as info to the primary database to get the main library itself. The reason is that when the client enters the subscription mode, it can only accept messages and cannot send commands, so a connection needs to be established.

When the connection to the monitored master Library is complete, Sentinel does the following:

Each 10s sends an info command to the primary database and from the database;
Every 1s ping is sent to master, slave, and other sentinel nodes;
Each 2s sends its own information to master and slave's __sentiel__:hello channel to announce its existence, and the process is also the basis for automatic discovery between Sentinels;

These three operations run through the Sentinel's entire life cycle, which is very important and the core of its principles, so the following details the operation process.

First, when Sentinel starts, sending an info command to the main library allows Sentinel to get information about the current main library (including the running ID, replication information, and the From-Library node information belonging to the main library). This is why when configuring monitoring only the main library information that needs to be configured for monitoring Sentinel will automatically find its corresponding from the library, in order to realize the monitoring from the library. Then and each from the library to establish the same two connections, the two connections and the above described in connection with the main library connected exactly the same, after that, the Sentinel will every 10s timing to the known all the master-slave send info command to obtain information updates and appropriate operations, such as the addition of the new slave library to join the monitoring queue, or the main library information changes (caused by failover) to update information and so on.

The Sentinel then sends messages to master and slave's __sentinel__:hello channel to share their information with other Sentinels who also monitor the Redis sample. The message content sent is:

< Sentinel address >,< Sentinel Port >,< Sentinel running id>,< Sentinel configured version >,< main Library name >,< Main Library address >,< Main Library port >,< main Library configuration version The message contains the Sentinel's basic information, as well as the monitored master Library information, and when other Sentinel receives the message, it will determine if the Sentinel that sent the message is a new sentinel, and if so, add it to the discovered Sentinel list, and create a connection to it (unlike the database) Sentinel and Sentinel will only create a connection to send the ping command, while Sentinel will determine the configuration version of the primary database, if the version is higher than the record database version, then update the primary database data, its role in the subsequent introduction.

After implementing Autodiscover from the database and other sentinel nodes, Sentinel's next task is to periodically monitor whether the discovered master-slave nodes and Sentinel nodes are online. This kind of monitoring is implemented by sending a ping command at a certain interval, the interval configuration is specified by Down-after-milliseconds, and when the time of down-after-milliseconds configuration is exceeded, If the ping database or Sentinel does not reply, the Sentinel considers its subjective downline (the subjective downline has been described above), and if the node is the main vault Sentinel will further determine whether it needs to be failed recovery (failover) : Sentinel sends the Sentinel IS-MASTER-DOWN-BY-ADDR command to inquire whether other sentinel nodes also consider the main library to be a subjective downline, and if the specified number is reached (as explained in the example configuration, 2), the Sentinel considers its objective offline, and select the Lead Sentinel (leader) for fault recovery, election process follow-up introduction.

After the election of the Odd Sentinel, the lead Sentinel starts to fail back the primary database, a process known as failover, and Sentinel considers the following when selecting a new master:

The length of time that the connection with Master was disconnected
Slave priority (specified by Slave-priority configuration)
Copy offset offsets
ID of the instance run (run ID)

The specific selection sequence is as follows:

If a slave is disconnected from master by more than 10 times times the down-after-milliseconds, plus the length of the master outage, then Slave is considered unsuitable for the election as master and the formula is as follows:

(Down-after-milliseconds *) + milliseconds_since_master_is_in_sdown_state

The next step is to sort the slave

According to slave priority ranking, the lower the slave-priority, the higher the priority;
If the slave priority is the same, then comparing the copy offset, the higher the offset (the larger), the closer the synchronization with the old Master library data, the higher the precedence.
If all two of the above conditions are the same, select the one with the lowest run ID from the library;

After selecting from the library, the Change Sentinel will send the slaveof NO one command from the database to upgrade it to the new main library, and then in the Send Slaveof command to the other from the library to upgrade from the main library to the latest main library, and finally update the internal records to update the stopped main library to the new Master library from the library, So that when the failure of the main library again automatically to continue to provide services from the library role, from boot to recovery to complete the column process is the Sentinel's work is the complete process is also its rationale.

Lead Sentinel election

It is mentioned in the principle that when Sentinel discovers that the main library is in the lead sentinel election for failure recovery, its election algorithm uses the raft algorithm, which is why the design idea is similar to the Zookpeer, the electoral process is broadly as follows:

The Sentinel node (known as a), which discovers the main library's objective downline, sends commands to each sentinel node to elect himself as the lead Sentinel (leader);
If the target Sentinel has not elected another person, it agrees to elect a as the lead Sentinel;
If a sentinel node with more than half of the quorum parameter values has been found to have chosen to be the lead Sentinel, a Sentinel is successfully elected as the lead Sentinel.
When more than one sentinel node participates in the lead Sentinel election, no node is elected, at which point each candidate waits for a random time for the next round of elections until the lead Sentinel is chosen.

Configure version number function

Similarly, when the principle is introduced to the master configuration version number, when a sentinel is authorized, it will get a new configuration version of the outage master, and when failover execution is finished, this version number will be used for the latest configuration. Because most Sentinel already knows that the version number has been taken away by Sentinel to execute failover, other sentinel can no longer use this version number. This means that each failover will be accompanied by a unique version number. We will see the importance of doing so.

Also, Sentinel clusters adhere to a rule: if Sentinel a recommends Sentinel B to execute failover,a, it will wait for a period of time to perform failover on the same master again. This wait time is configured through the Failover-timeout configuration item. As can be seen from this rule, Sentinel in the Sentinel cluster will not be able to failover the same master again at the same time, and the first Sentinel to perform failover if it fails, The other one will be re-failover within a certain amount of time, and so on.

Sentinel is guaranteed to be active: If most Sentinel can communicate with each other, there will eventually be a license to failover.
Sentinel also guarantees security: Every Sentinel who tries to failover the same master will get a unique version number.

V. Sentinel State Persistence

The state of the snetinel is persisted to the Sentinel configuration file. Each time a new configuration is received, or when a new configuration is created, the configuration is persisted to the hard disk and with the configured version stamp. This means that the sentinel process can be stopped and restarted safely. The following is the overridden configuration file:

VII. Configuration and dissemination

Once a Sentinel has successfully failover a master, it notifies other Sentinel of the latest configuration about master, and the other sentinel updates the configuration for master. For a faiover to be successful, Sentinel must be able to send the slave of NO one command to the slave selected as master, and then be able to see the configuration information of the new master with the info command.

When a slave is elected master and sent slave of no one ', failover is considered successful even if the other slave has not reconfigured itself for the new master, and then all Sentinels will publish the new configuration information.

The new distribution in the cluster is the reason why we need to be granted a version number when a Sentinel is failover.

Each Sentinel uses a publish/subscribe approach to continuously propagate the configuration version information of master, configuring the Publish/Subscribe pipeline for propagation: __sentinel__:hello, we can view the messages on the channel by subscribing to their channel, as follows:

Because each configuration has a version number, the one with the largest version number is the standard. For example, suppose you have an address named MyMaster 10.1.210.69:6379. At first, all Sentinel in the cluster knew the address, so the configuration for MyMaster was version number 1. After a while MyMaster died and a sentinel was authorized to failover it with version number 2. If failover succeeds, assuming the address is changed to 10.1.210.69:6380, and the configured version number is 2, Sentinel for failover will broadcast the new configuration to the other sentinel, since the other Sentinel maintains a version number of 1, found that the new configuration version number is 2 o'clock, the version number is larger, the configuration is updated, so the latest version number 2 configuration.

Viii. concluding remarks

From the beginning of Redis to Sentinel mode, and then to the usual use of its API for related operations, through a period of time the study of Redis also has a certain level of understanding, so take these processes are recorded, share to other people, hope that more people not only know how to use, more understand the principle, In the wrong time can even the location of the problem. Of course, may be in the article may exist in the wrong place also welcome to correct, after all, there is no source-level understanding. The last part that may need to be researched is the Redis cluster, followed by the introduction of the article after the end of the study, which is also a more comprehensive understanding of Redis.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More