- Zookeeper is not designed for high availability
- Many systems need to be deployed across data centers for Disaster Tolerance. For cost-effectiveness, we usually allow multiple data centers to work at the same time, instead of setting up n times of redundancy. That is to say, a single data center cannot support full traffic (Can you imagine that Google has only one data center working in the world ). Because the zookeeper cluster can only have one master, once the connection between data centers fails, the zookeeper master can only take care of one data center. Business modules running in other data centers can only be stopped because they do not have a master. As a result, all the traffic is concentrated in the data center with the master, so the system crash.
- Even in the same data center, due to different network segments, network segments are occasionally isolated when the switch of the data center is adjusted. In fact, subnet segments such as network isolation usually occur in the data center every month. Zookeeper will be unavailable at that time. If the entire business system is based on zookeeper (for example, Zookeeper is required for each business request to obtain the Master Address of the Business System), the availability of the system will be very fragile.
- Because zookeeper is extremely sensitive to network isolation, Zookeeper will make a drastic response to any traffic flooding attacks on the network. This makes the zookeeper unavailable for a long time. We cannot make the zookeeper unavailable to the system unavailable.
The election process of zookeeper is slow.
- This is a weakness that is hard to see from theoretical analysis, but once you encounter it, it will be hard to get lost.
- As we have already said, the network is often isolated and other incomplete states, and zookeeper is very sensitive to that situation. Once network isolation occurs, Zookeeper initiates the election process.
- The election process of zookeeper usually takes 30 to 120 seconds. During this period, Zookeeper is unavailable because it does not have a master.
- For the occasional appearance in the network, such as network isolation for half a second, Zookeeper will increase the unavailable time by dozens of times due to the election process.
Zookeeper has limited performance.
- The typical zookeeper TPS is more than 10 thousand, which cannot cover billions of calls within the system every day. Therefore, it is impossible to obtain the master information of the Business System from zookeeper for each request.
- Therefore, the zookeeper client must cache the Master Address of the business system.
- Therefore, Zookeeper provides 'strongly consistent ', which is actually unavailable. If we need strong consistency, we also need other mechanisms to ensure that: for example, if we use automated scripts to kill the old master of the business system, there will be many traps (this topic will not be discussed here first, readers can think about the traps on their own ).
Zookeeper cannot perform effective permission control.
- Zookeeper has weak permission control.
- In a large complex system, Zookeeper must be used to develop an additional permission control system, and then access zookeeper through the permission control system.
- The additional permission control system not only increases system complexity and maintenance costs, but also reduces the overall system performance.
Even with zookeeper, it is difficult to avoid data inconsistency in the business system.
- As we have discussed earlier, due to the performance limitation of zookeeper, we cannot let every internal call of the system go through zookeeper, so there will always be some time, there will be two masters in the Business System (the master information of the Business System cached by the Business System Client is updated from zookeeper on a regular basis, so there will be an issue of non-synchronization of updates ).
- If you want to maintain system data consistency when the master information of the Business System Client is not consistent, the only method is to kill the old master first, then update the master information on zookeeper ". However, the program cannot be automatically determined whether to kill the current master (because zookeeper is unavailable when the network is isolated, and the automatic script has no global information, no matter how you do it, it may be wrong. When the network is faulty, only the O & M personnel have global information, and the program cannot answer the phone to learn about the situation in other data centers ). Therefore, the system cannot automatically ensure data consistency. manual intervention is required. The typical time for manual intervention is more than half an hour. We cannot make the system unavailable for such a long time. Therefore, we must compromise in a certain direction. The most common way to compromise is to discard 'strong options' and accept 'final options '.
- If we need manual intervention to ensure 'high reliability and consistency ', the value of zookeeper will be greatly reduced.
What can we do
- We can also choose strong consistency for manual intervention or weak consistency for program automation. You need to make a trade-off.
- Eventually consistency may not even be implemented by programs. Sometimes manual data correction is flexible, reliable, and cost-effective. This requires a balance.
- Do not be superstitious about zookeeper. Sometimes you may wish to consider the primary and standby databases. The database comes with permission control, which is much easier to use than zookeeper.
- Zookeeper may be useful in notifying all online clients to update information in real time by blocking callback when the content changes. It is hard to say whether PHP is an online or offline module, and every time it is initiated. Once this function does not support PHP, it cannot cover the entire system, so it cannot guarantee strong consistency.
PS: I have some personal knowledge about zookeeper. You are welcome to discuss and correct it.