The memory-based Redis should be the most commonly used Key-value database in various web development businesses, and we often use it to store user login states (session storage) to speed up some of the hot data queries (compared to MySQL, which has an order of magnitude improvement), Do simple Message Queuing (Lpush and Brpop), subscription Publishing (PUB/SUB) systems, and so on. Larger-scale internet companies typically have dedicated teams that provide Redis storage as a basic service to individual business calls.
However, the provider of any underlying service will be asked by the caller: is your service highly available? It is best not to cause my business to suffer because of the frequent problems with your service. A small "high-availability" Redis service has recently been set up in the project to make a summary and think about it.
The first thing to define is how high availability is for REDIS services, which can still be served in the event of a variety of anomalies. Or loose, in the case of anomalies, only after a short time to restore normal service. The so-called anomaly should contain at least the following possibilities:
"Exception 1" a process of a node server suddenly down (for example, a development hand residue, the redis-server process of a server kill)
"Exception 2" a node server down, equivalent to all the processes on this node is stopped (for example, a maintenance hand, the power of a server to unplug; For example, some older machines have hardware failures)
"Exception 3" any two node communication between the server is interrupted (for example, a temporary worker, the cable used for communication between the two rooms has been cut off)
In fact, any one of these anomalies is a small probability event, and the basic guideline for high availability is: the probability that multiple small probability events occur simultaneously can be neglected. High availability can be achieved as long as our systems are designed to tolerate a single point of failure within a short period of time.
There are a number of options available on the web for high-availability Redis services, such as Keepalived,codis,twemproxy,redis Sentinel. Among them, Codis and twemproxy are mainly used in large-scale redis clusters and are open source solutions offered by Twitter and pea pods before Redis officially launches Redis Sentinel. The amount of data in my business is not large, so clustering services is a waste of machines. Finally, a choice was made between keepalived and Redis Sentinel, and the official solution, Redis Sentinel, was chosen.
Redis Sentinel can be understood as a process that monitors whether the Redis Server service is healthy and, once detected, can automatically enable backup (slave) Redis server so that external users are unaware of the exceptions that occur inside the Redis service. We set up one of the smallest high-availability Redis services in a simple-to-complex process.
Scenario 1: Standalone Redis Server, no Sentinel
In general, we take a personal site, or usually do development, a single instance of Redis Server. The caller connects directly to the Redis service, and even the client and Redis itself are on the same server. This collocation is only suitable for individual learning and entertainment, after all, this configuration will always have a single point of failure problems can not be solved. Once the Redis service process is hung up, or server 1 is down, the service is unavailable. And if Redis data persistence is not configured, the data stored within Redis is also lost.
Scenario 2: Master-slave synchronous Redis Server, single instance Sentinel
In order to achieve high availability, the single point of failure problem described in solution 1, we must add a backup service that starts a Redis server process on both servers, typically by master, and slave only for synchronization and backup. At the same time, an additional Sentinel process is launched to monitor the availability of two Redis server instances so that when master hangs up, the role of slave to Master will continue to be served, enabling high availability of Redis server. This is based on the design of a highly available service, where a single point of failure is itself a small probability event, and multiple single-point simultaneous failures (that is, master and slave simultaneously hang) can be considered (basic) impossible events.
The call to the Redis service Towners says that the Redis Sentinel service is now connected, not Redis server. The common invocation process is that the client connects to Redis Sentinel and asks the current Redis server which service is master, which is slave, and then connects to the appropriate Redis server for operation. Of course, the current third-party library has generally implemented this call process, no longer need us to implement manually (such as Nodejs ioredis,php Predis,golang Go-redis/redis,java Jedis, etc.).
However, after we realized the master-slave switching of the Redis Server service, we introduced a new problem, that is, Redis Sentinel itself is a single point of service, and once the Sentinel process hangs up, the client has no way to link Sentinel. Therefore, the configuration of scenario 2 does not enable high availability.
Scenario 3: Master-slave synchronous Redis Server, dual instance Sentinel
In order to solve the problem of scenario 2, we have also started the Redis Sentinel process with an extra copy, while the two Sentinel processes provide service discovery capabilities to clients. For the client, it can connect to any Redis Sentinel service to get basic information about the current Redis server instance. In general, we will configure multiple Redis Sentinel link addresses on the client side, and once the client discovers that an address is not connected, it will attempt to connect to other sentinel instances, which of course does not need to be implemented manually. The more popular Redis connection libraries in each development language have helped us with this feature. We expect that even if one of the Redis Sentinel is dead, there is another sentinel that can provide services.
However, the vision is beautiful, but the reality is very cruel. With this architecture, it is still not possible to achieve high availability of redis services. In Scenario 3, the Red line is the communication between the two servers, and the exception scenario ("Exception 2") is that a server is down, assuming server 1 is down, and only the Redis Sentinel and slave Redis server processes above server 2 are left. At this point, Sentinel will not actually switch the remaining slave to master to continue service, which will result in the Redis service being unavailable, because Redis is set up only when more than 50% of Sentinel processes can connect and vote for new master. The master-slave switch is really happening. In this case, only one of the two Sentinel can be connected, equal to 50% is not in the scene that can be switched from master to slave.
You might ask, why does Redis have this 50% setting? It is assumed that we allow Sentinel-to-master switching in scenarios that are less than or equal to 50%. Imagine "Exception 3", which is a network outage between server 1 and server 2, but the server itself can be run. As shown in the following:
In fact, for server 2, server 1 directly down and server 1 network connection is the same effect, anyway, is suddenly unable to make any communication. Assuming that the network is down, we allow Sentinel server 2 to switch slave to master, and as a result you now have two Redis servers that can provide services externally. The client makes any additions or deletions, which may fall on the Redis on server 1, or it may fall on Redis on server 2 (depending on which sentinel the client is connected to), causing data confusion. Even if the network between server 1 and server 2 is back up again, then we can't unify the data (two different data, who should trust?). ), data consistency is completely compromised.
Scenario 4: Master-slave synchronous Redis Server, three instance Sentinel
Since scenario 3 does not have the means to be highly available, our final version is the one shown in Scenario 4. In fact, this is the structure we eventually build. We introduced server 3 and built a Redis sentinel process on 3, and now we have three sentinel processes to manage two Redis server instances. In this scenario, the Redis service can continue to be available to the outside, whether it is a single process failure, a single machine failure, or a two-machine network communication failure.
In fact, if your machine is idle, you can also open a Redis server on server 3, which forms a 1 master + 2 slave architecture, with two backups per data and a higher availability. Of course, is not slave the more the better, after all, master-slave synchronization also takes time cost.
In Scenario 4, once communication between server 1 and other servers is completely interrupted, servers 2 and 3 will switch slave to master. For the client, there will be 2 master services in such a moment, and once the network is restored, all new data falling on server 1 during the outage will be lost. If you want to partially resolve this problem, you can configure the Redis server process to stop the service immediately when it detects a problem with its network. Avoid new data during network failures (refer to Redis's min-slaves-to-write and Min-slaves-max-lag configuration items).
So far, we've built a highly available Redis service with 3 machines. In fact, there are more machine-saving methods on the Internet, is to put a sentinel process on the client machine, rather than the service provider's machine. Only in the company, the providers and callers of the General Service do not come from the same team. Two teams working on the same machine, it's easy because the communication problem leads to some mis-operation, so we adopt the architecture of scenario 4 for this kind of human factor. And because server 3 only ran a sentinel process, the server resource consumption is not much, you can also use the server to run some other services.
Ease of use: using Redis Sentinel like a single-use Redis
As a provider of services, we always talk about user experience issues. There is always a less comfortable place for clients to use in these scenarios. For a stand-alone version of Redis,client directly connected to Redis Server, we only need to give an IP and port,client to use our services. After the conversion to Sentinel mode, the client has to adopt some external dependency packages that support Sentinel mode, and also modify its Redis connection configuration, which is obviously not acceptable to "sentimental" users. Is there a way to use a single version of Redis, just to give the client a fixed IP and port to serve?
The answer, of course, is yes. This is likely to introduce virtual IP (IP,VIP) as shown in. We can point the virtual IP to the server where Redis server master is located, and when the Redis master-slave switch occurs, a callback script is triggered, and the callback script switches the VIP to the server where the slave resides. For the client side, he seems to be using a standalone version of the highly available Redis service.
Conclusion
It is very simple to build any service that "can be used", just as we are running a standalone version of Redis. But once the "high availability" is achieved, things become complicated. The business uses an additional two servers, 3 Sentinel processes + 1 slave processes, just to ensure that the service is still available in the event of a small probability. In the actual business we have also enabled supervisor to do process monitoring, and once the process exits unexpectedly, it will automatically attempt to restart.