analysis and construction of high availability Redis service architecture
Redis based on memory should be the most commonly used Key-value database in various web development business, we often use it to store user login state (session storage) in the business, accelerate some hot data query (compared with MySQL, the speed has an order of magnitude of Ascension), Do simple message queues (Lpush and Brpop), subscription Publishing (PUB/SUB) systems, and more. Larger-scale internet companies typically have dedicated teams that provide Redis storage as an underlying service to individual business calls.
But one of the questions that any provider of the underlying service will be asked by the caller is whether your service is highly available. It's better not to have a problem with your service and cause my business to suffer. Recently, my project also built a set of small "high availability" Redis service, here to do their own summary and thinking.
First we want to define what is high availability for redis services, that is, in all kinds of abnormal situations, can still provide normal service. Or loose some, in the case of abnormal, only after a very short period of time to restore normal service. The so-called exception should contain at least the following possibilities:
"Exception 1" a process on a node server suddenly drops (for example, a development hand is disabled, kill the redis-server process of a single server)
"Exception 2" a node server down, the equivalent of this node all processes are stopped (for example, some of the operation of the robot hand, the power of a server unplugged; For example, some old machine hardware failure)
"Exception 3" any two nodes of the server communication between the interruption (for example, a temporary worker, the two computer room to communicate the cable dug up)
In fact, any of these anomalies are small probability events, and the basic guiding principle of high availability is that the probability of multiple small probability events can be negligible. As long as we design a system that can tolerate a single point of failure in a short time, you can achieve high availability.
For building highly available Redis services, there are many programs on the web, such as Keepalived,codis,twemproxy,redis Sentinel. Codis and Twemproxy are mainly used in large-scale redis clusters, and are open source solutions offered by Twitter and pea pods before Redis officially released Redis Sentinel. The amount of data in my business is not large, so it is a waste of machines to do Cluster service. Finally, a choice was made between keepalived and Redis Sentinel, the official solution Redis Sentinel.
Redis Sentinel can be understood as a process that monitors the normal Redis Server service and, once it detects an anomaly, automatically enables the backup (slave) Redis server to enable external users to be aware of the exception that occurs within the Redis service. We build a small and highly available Redis service in accordance with the steps from simple to complicated.
Scenario 1: Stand-alone version Redis Server, no Sentinel
Under normal circumstances, we build a personal website, or usually do development, will play a single instance of the Redis Server. The caller directly connects to the Redis service, and even the client and Redis themselves are on the same server. This collocation is only suitable for individual learning and entertainment, after all, this configuration will always have a single point of failure can not solve the problem. Once the Redis service process hangs, or the server 1 is down, the service is unavailable. And if Redis data is not persisted, the data stored inside the Redis will be lost.
Scenario 2: Master-Slave synchronization Redis Server, single instance Sentinel
To achieve high availability, solve the single point of failure problem described in Scenario 1, we must add a backup service that starts each REDIS server process separately on both servers, typically serviced by master, and slave only for synchronization and backup. At the same time, an additional Sentinel process is launched to monitor the availability of two Redis server instances so that when master hangs out, the role of slave promotion to master continues to be serviced, which enables the Redis server to be highly available. This is based on the design of a highly available service, where a single point of failure itself is a small probability event, while multiple single point faults (i.e. master and slave are hung simultaneously) can be considered (basic) impossible events.
The call to the Redis service, callers, is now connected to the Redis Sentinel service rather than the Redis server. The common calling procedure is that the client first connects to the Redis Sentinel and asks which service in Redis server is master, which is slave, and then connects to the corresponding Redis server for operation. Of course, the current Third-party library has generally implemented this call process, no longer need us to implement manually (such as Nodejs ioredis,php Predis,golang Go-redis/redis,java Jedis, etc.).
However, we have implemented the Redis Server service after the master-slave switching, but also introduced a new problem, that is, Redis Sentinel itself is a single point of service, once the Sentinel process hangs, then the client will not be able to link Sentinel. Therefore, the configuration of scenario 2 does not achieve high availability.
Scenario 3: Master-Slave synchronization Redis Server, dual instance Sentinel
In order to solve the problem of scenario 2, we also started a Redis sentinel process, and two Sentinel processes provided the service discovery function for the client. For the client, it can connect to any Redis Sentinel service to get the basic information for the current Redis server instance. Normally, we will configure multiple Redis Sentinel link addresses on the client side, and if the client finds that an address is not connected, it will attempt to connect to other sentinel instances, which of course does not require our manual implementation, The more popular Redis connection libraries in each development language help us achieve this function. We expect it to be: even if one of the Redis Sentinel is hung up, there is another sentinel that can provide services.
However, the vision is good, the reality is very cruel. This architecture still fails to achieve the high availability of Redis services. Scenario 3 schematic diagram, the red line is the communication between two servers, and we envisage the exception scene ("Abnormal 2") is, a server as a whole down machine, you may assume that the server 1 downtime, at this time, only the server 2 above the Redis Sentinel and slave Redis The server process. At this point, Sentinel will not be the only remaining slave switch to master to continue service, it will lead to Redis service is not available, because the Redis setting is only when more than 50% of the Sentinel process can connect and vote to select the new master, Will really happen master-slave switching. In this example two sentinel only one can connect, equal to 50% is not in the scene that can master and slave switch.
You may ask why Redis have this 50% set. Let's say we allow the Sentinel to be less than or equal to 50% to perform a master-slave switch. Consider "Exception 3", a network outage between server 1 and server 2, but the server itself can be run. As shown in the following illustration:
In fact, for server 2, server 1 directly down and server 1 network connection is the same effect, anyway, is suddenly unable to do any communication. Suppose that we allow server 2 Sentinel to switch slave to master when the network is interrupted, and the result is that you now have two Redis servers that you can service externally. Client do any additions and deletions to the operation, it is possible to fall on the server 1 Redis, it is possible to fall on the server 2 Redis (depending on the client in the end of which Sentinel), resulting in data chaos. Even if the network between server 1 and server 2 is back up, then we can't unify the data (two different data and who to trust.) ), data consistency is completely compromised.
Scenario 4: Master-Slave synchronization Redis Server, three instance Sentinel
Given that programme 3 does not have the means to be highly available, our final version is the scenario 4 shown in the figure above. In fact, this is the structure that we finally build. We introduced server 3 and built a Redis sentinel process on 3, and now we have three sentinel processes to manage two Redis server instances. This scenario, whether it is a single process failure, or a single machine failure, or some two machine network communication failure, can continue to provide redis services.
In fact, if your machine is idle, you can also open a Redis server on server 3 to form a 1 master + 2 slave architecture with two backups per data and a few more usability improvements. Of course, is not slave the more the better, after all, master-slave synchronization is also a need for time costs.
In Scenario 4, once communication between server 1 and other servers is completely interrupted, servers 2 and 3 switch slave to master. For the client, there will be 2 master services at such an instant, and once the network is restored, all new data that falls on server 1 during the outage will be lost. If you want to partially address this issue, you can configure the Redis server process to stop service immediately when it detects problems with its own network. Avoid new data coming in during a network failure (refer to Redis's min-slaves-to-write and Min-slaves-max-lag configuration items).
So far, we have built a highly available Redis service with 3 machines. In fact, there are more ways to save machines on the Internet, which is to put a sentinel process on the client machine, not the service provider's machine. Only in the company, the provider and caller of the General Service do not come from the same team. Two teams working together on the same machine, it is easy because of communication problems caused by some misoperation, so for this human factors, we have adopted the framework of scenario 4. And because server 3 above only ran a sentinel process, the server resource consumption is not much, can also use server 3来 run some other services.
Ease of use: using Redis like a stand-alone version of Redis Sentinel
As a provider of services, we always talk about user experience issues. There is always a place in the above scenario where the client end is not comfortable. For a stand-alone version of the redis,client-side direct connection to Redis Server, we only need to give an IP and port,client can use our services. and transformed into Sentinel mode, the client had to use some support Sentinel mode of external dependency package, and also to modify their Redis connection configuration, which for "sentimental" users is obviously not acceptable. There is no way or as in the use of stand-alone version of the Redis, only to the client a fixed IP and port can provide services.
The answer is certainly yes. This may lead to the introduction of virtual IP,VIP, as shown in the previous illustration. We can point the virtual IP to the server where Redis server master is located, and in the event of a Redis master-slave Switch, a callback script is triggered, and the callback script switches the VIP to the server where the slave resides. So for the client side, he seems to be using a stand-alone version of the highly available Redis service.
Conclusion
Build any service, do "can use" in fact is very simple, just like we run a stand-alone version of the Redis. But once it's "high availability", things get complicated. The business uses an additional two servers, 3 Sentinel processes + 1 slave processes, just to ensure that the service is still available in the event of a small probability. In the actual business we also enabled the supervisor to do process monitoring, once the process quits unexpectedly, will automatically try to restart.