US reviews the evolution and assumption of the high availability architecture of the database

Source: Internet
Author: User
Tags ack dba zookeeper redis cluster

US reviews the evolution and assumption of the high availability architecture of the database

Jinlong · 2017-06-29 20:11

This article describes the evolution of the high-availability architecture of the MySQL database in recent years, as well as some of the innovations we have made on the basis of open source technology. It also makes a comprehensive comparison with other industry programmes to understand the industry's progress on high availability and some of our future plans and prospects.

MMM

Before 2015, the United States reviews (reviews side) long-term use of MMM (Master-master Replication Manager for MySQL) to do a database high availability, accumulated a lot of experience, also stepped on a lot of pits, It can be said that MMM has played a great role in the rapid development of the company database.

The structure of MMM is as follows.

As shown above, the entire MySQL cluster provides 1 write VIP (Virtual IP) and N (n>=1) read VIP to provide external services. Each MySQL node is deployed with one agent (mmm-agent), mmm-agent and Mmm-manager maintain communication status, and periodically escalate the current MySQL node's survival (known as heartbeat) to Mmm-manager. When the Mmm-manager is unable to receive mmm-agent heartbeat messages multiple times, the switch operation is performed.

Mmm-manager handles the exception that occurs in two cases.

    1. The exception that occurs is from the node

      • Mmm-manager will attempt to remove the read VIP from the node and drift the read VIP to other surviving nodes in this way to achieve high availability from the library.
    2. The main node where the exception occurred

      • If the node is not completely hung at that time, the response is timed out. Try to add the dead master to the global lock (flush tables with read lock).
      • Select a candidate master node from the node as the new master node for data completion.
      • After the data has been completed, remove the Dead master's write VIP and attempt to add it to the new master node.
      • The other surviving nodes are data-padded and re-mounted on the new master node.

After the failure of the main library, the entire cluster state changes as follows:

Mmm-manager detects that the Master1 has failed, after the data has been completed, the write VIP drift to the Master2, apply the write operation on the new node continue.

However, the MMM architecture has the following problems:

    • The number of VIPs is excessive and management is difficult (there was a cluster that was 1 Master 6 from, a total of 7 VIPs). In some cases, the majority of the VIP in the cluster is lost at the same time, which makes it difficult to distinguish which VIP was used earlier on the node.
    • Mmm-agent is overly sensitive and can easily lead to VIP loss. At the same time mmm-agent itself because there is no high availability, once hung up, will cause Mmm-manager misjudged, mistakenly think that MySQL node anomaly.
    • Mmm-manager there is a single point where the entire cluster loses its high availability once it is hung for some reason.
    • VIP need to use ARP protocol, cross-network segment, cross-room high availability of the basic can not be achieved, limited protection capabilities.

At the same time, MMM is the Google Technology team developed a relatively old high-availability products, not many in the industry, the community is not active, Google has long ago no longer maintain the MMM code branch. We found a lot of bugs in the use process, some bugs we have made changes, and submitted to the open source community, interested students can refer to here.

MHA

In view of this, since 2015, the United States of America review of the MySQL high-availability architecture has been improved, all updated to MHA, to a large extent, to solve the various problems before MMM.

MHA (MySQL Master high availability) is a MySQL highly available software developed by the Facebook engineer Yoshinori Matsunobu. As you can see from the name, MHA is only responsible for the high availability of the MySQL main library. When the main library fails, MHA selects a candidate master node that is closest to the primary repository (there is only one from the node, so the node is the candidate Master node) as the new primary node, and the Binlog is padded with the previous dead master differences. After the data has been filled, the VIP will be shifted to the new main library.

The architecture of the entire MHA is as follows (for simplicity, describe only one master-one from):

Here we have made some optimizations to MHA to avoid some brain crack problems.

For example, the DB server's uplink switch has jitter, causing the main library to be inaccessible, the management node is determined as a failure, triggering MHA switch, VIP was floated to the new main library. Then the switch recovers, the main library can be accessed, but because the VIP is not removed from the main library, so 2 machines have VIP at the same time, can produce brain crack. We have added a probe to MHA manager for other physical machines on the same rack, to determine whether it is a network fault or a single machine fault by comparing more information.

Mha+zebra (DAL)

Zebra (Zebra) is a review of the Java database Access middleware developed by the infrastructure team, which is a dynamic data source that is packaged on a c3p0 basis, including read-write separation, sub-database sub-table, SQL Flow control and other very strong functions. Together with MHA, it has become an important part of MySQL database high availability. The following is the overall architecture of Mha+zebra mates:

In the case of failure of the main library, there are two ways to handle logic:

    • When the MHA switch is complete, the active message is sent to Zebra Monitor,zebra Monitor to update the configuration of the zookeeper, marking the read traffic configured on the main library as offline.
    • Zebra Monitor detects the health of nodes in the cluster at intervals (10s ~ 40s) and, once a node is found to be in trouble, refreshes the configuration in zookeeper and marks the node as offline.

Once the node change is complete, the client hears that the node has changed and immediately rebuilds the connection with the new configuration, and the old connection is gradually closed. The entire cluster failover process is as follows (only describes the case of Zebra Monitor active detection, the first type of MHA notification please self-^_^).

Due to the switching process or with the help of VIP drift, can only be in the same network segment or the same two-layer switch, can not be cross-network segment or cross-room high availability. To solve this problem, we have developed two times for MHA, removed the MHA add VIP, and informed Zebra Monitor to re-adjust the read/write information of the node after switching over (the write is adjusted to the real IP of new master, and the read traffic of Dead master is removed) , the entire switch to the VIP, so that cross-network segment, or even across the room to switch, completely resolved before the high availability only limited to the same network segment problem. The switching process becomes as follows.

However, the MHA management node in this way is a single point, and there is still a risk in the event of a network failure or a machine outage. At the same time, due to Binlog-based asynchronous replication between Master-slave, it can lead to the loss of data when the main library machine goes down or the main library is inaccessible, MHA switching process.

In addition, when the Master-slave delay is too large, it also provides additional time overhead for the data to be padded.

Proxy

In addition to Zebra middleware, the group has a set of proxy-based middleware that works with MHA. When the MHA switch, the active notification proxy for read and write traffic adjustment, proxy compared to zebra more flexible, but also can cover non-Java application scenarios. The disadvantage is that the access link is more than one layer, the corresponding response time and failure rate also has a certain increase. Interested students can go to GitHub on their own to check the detailed documentation.

Future Architecture Vision

The MHA architecture mentioned above still has the following two questions:

    • Manage node single points.
    • Data loss in MySQL asynchronous replication.

For this, we use semi-sync in some of our core businesses to ensure that data is not lost in more than 95% scenarios (there are still some extreme cases where strong data consistency cannot be guaranteed). In addition, the highly available use of distributed agents, after a node failure, through a certain election protocol to select the new master, thus solving the MHA Manager single point of problem.

In response to these questions, we have studied some of the industry's leading practices, briefly described below.

Master-Slave synchronization data loss

For data loss for master-slave synchronization, one approach is to create a binlog server that simulates slave accepts binlog logs, and each time the primary library writes the data, it needs to receive an ACK response from Binlog Server before the write succeeds. Binlog server can be deployed on the nearest physical node, ensuring that every data write can be landed quickly to Binlog server. In the event of a failure, it is only necessary to pull data from the Binlog server to ensure that the data is not lost.

Distributed Agent High Availability

For MHA Management node single point problem, one approach is to let the MySQL DB cluster each node to deploy the agent, in the event of failure, each agent participates in the election vote, the election of the appropriate slave as a new main library, to prevent only through the manager to switch, remove the MHA single point. The entire architecture is as shown.

Mgr with middleware high availability

This approach solves the previous problem in some way, but agent and Binlog server are the new risks introduced, while the presence of Binlog server also brings additional overhead in response time. Is there a way to remove Binlog server and agent, and to ensure that data is not lost? The answer is of course there is.

In recent years, the MySQL community about distributed protocols raft and Paxos have been very hot, and the community has launched the MGR version of MySQL based on Paxos, which pushes the transition details to the upper layer through Paxos by pushing the consistency and switchover process into the database. The schema is as follows (take Mgr's single-primary as an example).

In the event of a database failure, MySQL internally switches itself. After the switchover is complete, push the topo structure to Zebra Monitor,zebra monitor for the corresponding read and write traffic changes. However, the same schema exists with Binlog server need to reply to confirm the problem, that is, each time the main library data write, the majority of nodes need to reply Ack, the write is successful, there is a certain response time overhead. At the same time, each MGR cluster must require an odd number (greater than 1) of nodes, resulting in the original need only one master from two machines, now need at least three, bringing a certain amount of waste of resources. But anyway, the advent of MGR is undoubtedly another great innovation in MySQL database.

Conclusion

This article describes the evolution of the high-availability architecture of the MySQL database from MMM to Mha+zebra and Mha+proxy, as well as some highly available practices in the industry. Database in recent years, the rapid development of high-availability database design is not a perfect solution, only continuous breakthroughs and innovation, we have been on this road to explore more excellent design and more perfect solution.

About the author

Jinlong, joined the new beauty in 14, mainly engaged in related database operations, high availability and related operation and maintenance platform construction. The students interested in the Vego can be concerned about my personal public number "own designer", and regularly push the original operations related content.

The DBA team recruits all types of DBA talent, and is available at base Beijing Shanghai. We are committed to providing the company with stable, reliable and efficient online storage services to build an industry-leading database team. There are large-scale distributed cache systems squirrel based on Redis cluster, as well as a massively improved distributed kv storage System cellar based on Tair, and thousands of MySQL instances of various architectures that provide trillions of OLTP access requests per day. A truly massive, distributed, high-concurrency environment. Welcome all friends to recommend or to jinlong.cai#dianping.com.

US reviews the evolution and assumption of the high availability architecture of the database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.