A detailed description of the Redis Sentinel mechanism

Last Update:2018-07-26 Source: Internet

Author: User

Tags failover redis redis cluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

Redis-sentinel is a highly available (HA) solution that is officially recommended by Redis, and if Master is down, the Redis itself (including many of its clients) does not implement automatic primary and standby switching when using Redis for master-slave high-availability scenarios. The Redis-sentinel itself is also a standalone process that can monitor multiple master-slave clusters and discover that Master is down and able to switch on its own.

Its main function has the following points

Periodically monitor whether Redis works as expected;

If a Redis node is found to be running, it can notify another process (such as its client);

Ability to switch automatically. When a master node is unavailable, it is possible to elect one of the master's multiple slave (if there are more than one slave) as the new master, The other slave node changes the address of the master that it follows to the new address of the slave that is promoted to master.
<br/> Sentinel Support cluster

Obviously, it is unreliable to use only a single Sentinel process to monitor redis clusters, and when the Sentinel process is down (Sentinel itself has a single point of issue, single-point-of-failure) the entire cluster system will not work as expected. So it is necessary to have sentinel clusters, so there are several benefits:

Even if some sentinel processes are down, the primary and standby switching of redis clusters is possible;

If there is only one sentinel process, if the process is running in error, or if the network is blocked, then the primary and standby switching of the Redis cluster (single point problem) will not be realized;

If you have multiple Sentinel,redis clients, you can freely connect to any sentinel to get information about the Redis cluster. Sentinel Version

Sentinel's current stable version is known as Sentinel 2 (distinguished from the previous Sentinel 1). With the installation package of the redis2.8 release. After installing Redis2.8, you can find the Redis-sentinel startup program in redis2.8/src/.

It is strongly recommended that:
If you're using redis2.6 (Sentinel version Sentinel 1), you'd better use the redis2.8 version of Sentinel 2 because Sentinel 1 has a lot of bugs that have been officially deprecated, Therefore, it is highly recommended to use redis2.8 and Sentinel 2. Running Sentinel

There are two ways to run Sentinel:

First Kind

Redis-sentinel/path/to/sentinel.conf

The second Kind

Redis-server/path/to/sentinel.conf--sentinel

In both of these ways, you must specify a Sentinel profile sentinel.conf, and if you do not specify it, you will not be able to start Sentinel. Sentinel listens on port 26379 by default, so it must be determined before running that the port is not occupied by another process. Sentinel's configuration

The Redis source package contains a sentinel.conf file as the Sentinel configuration file, which comes with an explanation of each configuration item. The typical configuration items are as follows:

Sentinel Monitor MyMaster 127.0.0.1 6379 2
Sentinel down-after-milliseconds mymaster 60000
Sentinel Failover-timeout mymaster 180000
Sentinel parallel-syncs mymaster 1

Sentinel monitor resque 192.168.1.3 6380 4< C5/>sentinel down-after-milliseconds resque 10000
Sentinel failover-timeout resque 180000
Sentinel Parallel-syncs Resque 5

The above configuration item is configured with two names of MyMaster and Resque master, the configuration file only needs to configure the master information, no configuration slave information, because slave can be automatically detected (the master node will have a message about slave )。 It is important to note that the configuration file is dynamically modified during Sentinel operation, for example, when a primary standby switch occurs, the master in the configuration file is modified to another slave. This allows Sentinel to restore the status of the Redis cluster that was previously monitored by this configuration when it restarts.

Next we will explain the above configuration items in one line:

Sentinel Monitor MyMaster 127.0.0.1 6379 2

This line represents Sentinel monitoring master's name called MyMaster, the address is 127.0.0.1:6379, and the last 2 of the end of the line represents what it means. We know that the network is unreliable, and sometimes a sentinel can be mistaken for a master Redis because of a network jam, and when Sentinel is clustered, the solution to this problem becomes simple, It only takes multiple Sentinel to communicate with each other to confirm that a master is really dead, and this 2 represents that when 2 Sentinel in the cluster believes that Master is dead, it is only true that the master is unavailable. (Each sentinel in the Sentinel cluster also communicates with each other via the gossip protocol).

In addition to the first line of configuration, we found that the rest of the configuration has a uniform format:

Sentinel <option_name> <master_name> <option_value>

Next we explain these configuration items according to the option_name in the format above:

Down-after-milliseconds
Sentinel sends a heartbeat ping to master to confirm that Master is alive, and if Master does not respond to pong within a "certain timeframe" or if it responds to an error message, then this Sentinel will subjectively (unilaterally) Think this master is no longer available (subjectively down, also referred to as Sdown). And this down-after-milliseconds is used to specify this "time range", in milliseconds.

However, it is important to note that Sentinel does not immediately failover the main switch, this sentinel also needs to refer to other Sentinel in the Sentinel cluster, If more than a certain number of Sentinel also subjectively think that the master is dead, then this master will be objectively (note oh, this is not subjective, is objective, and just subjectively down relative, this is objectively down, The abbreviation for Odown) thinks already dead. The number of sentinel numbers that need to be decided together is configured in the previous configuration.

Parallel-syncs
In the event of a failover master and standby switchover, this option specifies how many slave can be synchronized with the new master at the same time, and the smaller the number, the longer it will take to complete the failover, but if the number is greater, It means that the more slave are not available because of replication. This value can be set to ensure that only one slave is in a state that cannot handle a command request at a time.

Other configuration items are explained in detail in sentinel.conf.
All configurations can be dynamically modified at run time with the command Sentinel SET commands. Sentinel's "Arbitration meeting"

As we mentioned earlier, when a master is monitored by the Sentinel cluster, it is necessary to specify a parameter for it, which specifies the number of sentinel numbers required when the decision is made to be unavailable and failover, and we call this parameter the number of votes in this article for the time being.

However, when the failover primary and standby switch is actually triggered, failover is not immediately available and requires most Sentinel authorization in Sentinel before failover can be performed.
When Odown, failover is triggered. Once the failover is triggered, Sentinel tries to go to failover to get the "most" sentinel authorization (ask more Sentinel if the number of votes is larger than most)
The difference looks subtle, but it's easy to understand and use. For example, there are 5 Sentinel in the cluster, the votes are set to 2, and when 2 Sentinel thinks a master is unavailable, the failover will be triggered, but The Sentinel who carries out the failover must obtain at least 3 sentinel authorization before the failover can be implemented.
If the number of votes is set to 5, to reach the Odown state, all 5 Sentinel must assume that Master is not available, and to failover, all 5 Sentinel licenses will be granted. Configuration Version Number

Why do you really need to get the most sentinel approval before you can actually execute failover?

When a sentinel is authorized, it will get an up-to-date configuration version number for the outage master, and this version number will be used for the latest configuration after failover execution is completed. Because most Sentinel already knows that the version number has been taken away by Sentinel to execute failover, other sentinel can no longer use this version number. This means that each failover will be accompanied by a unique version number. We will see the importance of doing so.

Also, Sentinel clusters adhere to a rule: if Sentinel a recommends Sentinel B to perform failover, if B failover timeout,a wait for a while to failover, Go to the same master again to perform failover, the wait time is configured by the Failover-timeout configuration. As can be seen from this rule, Sentinel in the Sentinel cluster will not be able to failover the same master again at the same time, and the first Sentinel to perform failover if it fails, The other one will be re-failover within a certain amount of time, and so on.

Redis Sentinel guarantees active: If most Sentinel can communicate with each other, there will eventually be a license to failover.
Redis Sentinel also guarantees security: Every Sentinel who tries to failover the same master will get a unique version number. Configure Propagation

Once a sentinel successfully failover a master, it notifies other Sentinel of the latest configuration about master, and the other sentinel updates the configuration for master.

For a faiover to be successful, Sentinel must be able to send the slave of NO one command to the slave selected as master, and then be able to see the configuration information of the new master with the info command.

When a slave is elected master and sent slave of no one ', failover is considered successful even if the other slave has not reconfigured itself for the new master, and then all Sentinels will publish the new configuration information.

The new distribution in the cluster is the reason why we need to be granted a version number when a Sentinel is failover.

Each Sentinel uses the # #发布/Subscribe # #的方式持续地传播master的配置版本信息 To configure the propagated # #发布/Subscription # #管道是: __sentinel__:hello.

Because each configuration has a version number, the one with the largest version number is the standard.

Give me a chestnut: suppose there is an address named MyMaster 192.168.1.50:6379. At first, all Sentinel in the cluster knew the address, so the configuration for MyMaster was version number 1. After a while MyMaster died and a sentinel was authorized to failover it with version number 2. If failover succeeds, assuming the address is changed to 192.168.1.50:9000, and the configured version number is 2, Sentinel for failover will broadcast the new configuration to the other sentinel, since the other Sentinel maintains a version number of 1, found that the new configuration version number is 2 o'clock, the version number is larger, the configuration is updated, so the latest version number 2 configuration.

This means that the Sentinel cluster guarantees a second level of activity: a sentinel cluster capable of communicating with each other will eventually be configured with the highest version number and the same configuration. more details on Sdown and Odown

Sentinel is not available for two different views, one called subjective unavailability (Sdown), and the other called Objective unavailable (Odown). Sdown is Sentinel's own subjective detection of the state of master, Odown need a certain number of Sentinel to agree to believe that a master has been objectively down, Sentinel between Sentinel Is_ Master_down_by_addr to obtain additional Sentinel test results for master.

From the Sentinel's point of view, if the ping heartbeat is sent, after a certain amount of time has not received a legitimate reply, it reached the sdown condition. This time is configured through the Is-master-down-after-milliseconds parameter in the configuration.

When Sentinel sends a ping, one of the following replies is considered legitimate:

PING replied with +pong.
PING replied with-loading error.
PING replied With-masterdown error.

Any other reply (or no reply at all) is illegal.

Switching from Sdown to Odown does not require any consistency algorithm, only one gossip protocol: If a Sentinel receives enough Sentinel messages to tell it that a master has been dropped, the Sdown status becomes Odown. If the master is available later, the status will be cleaned up accordingly.

As has been explained before, real failover requires a process of authorization, but all failover begin in a odown state.

The Odown state applies only to master, and no negotiation is required between the Redis node Sentinel that is not master, and slaves and Sentinel will not have Odown status. automatic discovery mechanism between Sentinel and slaves

Although each sentinel in the Sentinel cluster is connected to each other to check the availability of each other and send messages to each other. But you don't have to configure any other sentinel nodes on any Sentinel. Because Sentinel uses the master's publish/subscribe mechanism to automatically discover other Sentinel nodes that also monitor the unified master.

Implemented by sending a message to a pipeline named __sentinel__:hello.

Similarly, you do not need to configure all slave addresses for a master in Sentinel, and Sentinel will get these slave addresses by asking for master.

Each Sentinel announces its presence by sending a message to each master and slave's publish/subscribe channel __sentinel__:hello per second.
Each Sentinel also subscribes to the content of each master and slave channel __sentinel__:hello to discover Unknown Sentinel, and when New Sentinel is detected, it is added to its own maintained Master monitor list.
Each Sentinel sends a message that also contains the latest master configuration for its current maintenance. If a Sentinel discovers
Your own configuration version is lower than the configured version you received, you will update your master configuration with the new configuration.

Before adding a new Sentinel to a master, Sentinel always checks to see if Sentinel is the same as the new Sentinel's process number or address. If so, the Sentinel will be deleted and the new Sentinel added. consistency in network isolation

The consistency model for the configuration of the Redis Sentinel cluster is final, and each sentinel in the cluster will end up with the highest version of the configuration. However, in an actual application environment, there are three different roles that will deal with Sentinel:

Redis instance.

Sentinel instance.

Client.

To examine the behavior of the system as a whole, we must take into account these three roles.

Here's a simple example, with three hosts, each running a Redis and a sentinel:

             +-------------+
             | Sentinel 1  | <---Client A
             | Redis 1 (M) |
             +-------------+
                     |
                     |
 +-------------+     |                     +------------+
 | Sentinel 2  |-----+--/partition/----| Sentinel 3 | <---Client B
 | Redis 2 (S) |                           | Redis 3 (M) |
 +-------------+                           +------------+

In this system, the initial state of Redis3 is master, Redis1 and Redis2 are slave. After the REDIS3 host network is unavailable, Sentinel1 and Sentinel2 start failover and elect Redis1 as master.

The features of the Sentinel cluster ensure that Sentinel1 and Sentinel2 have the latest configuration on master. But Sentinel3 still holds the configuration because it is isolated from the outside world.

When the network is restored, we know that SENTINEL3 will update its configuration. However, what happens if the master connected by the client is isolated from the network.

The client will still be able to write data to Redis3, but when the network is restored, REDIS3 becomes a slave of Redis, and the data written to REDIS3 by the client will be lost during network isolation.

Maybe you wouldn't want this scenario to happen:

If you use Redis as a cache, you may be able to tolerate the loss of this part of the data.

But if you use Redis as a storage system, you may not be able to tolerate the loss of this part of the data.

Because Redis uses asynchronous replication, there is no way to avoid data loss in such a scenario. However, you can configure REDIS3 and redis1 with the following configuration, so that data is not lost.

Min-slaves-to-write 1
Min-slaves-max-lag 10

With the above configuration, when a Redis is master, if it cannot write data to at least one slave (the above min-slaves-to-write specifies the number of slave), it will refuse to accept the client's write request. Because replication is asynchronous, master cannot write data to slave meaning that slave is either disconnected or does not send a request to synchronize data to master at the specified time (the Min-slaves-max-lag specified this time). Sentinel State Persistence

The status of the

Snetinel is persisted to the Sentinel's configuration file. Each time a new configuration is received, or when a new configuration is created, the configuration is persisted to the hard disk and with the configured version stamp. This means that the sentinel process can be stopped and restarted safely.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More