Business 0 Impact! How to skillfully use MySQL traditional replication technology "go" in online environment

Source: Internet
Author: User
Tags failover quiesce

Business 0 Impact! How to skillfully use MySQL traditional replication technology in online environment

This article I will not describe how to deploy a MySQL replication environment or keepalived+ dual-host environment, because this kind of installation and build the article has a lot of people are also very familiar with. In this article, I mainly introduce how MySQL replication technology is architected in the online business system environment, and the impact of these adjustments on online business systems is as small or as little as 0. I hope we can get something.

Monitoring Management in 1MySQL replication

As we all know, MySQL replication (whether classic traditional replication or 5.6 newly introduced Gtid replication) is based on the two core threads of Io_thread and Sql_thread. (On the principle of replication in this article I will not too much elaboration, interested can follow my future article)

Then we need to be concerned about the following issues in our practical work:

Obviously, the show slave status\g in slave gives us enough information to confirm the above questions. (Dual-master environment, two replication nodes have show slave status\g information)

The first thing to see in show slave status\g is:

Slave_io_running:yes

Slave_sql_running:yes

Slave_io_running represents whether Io_thread is working properly, slave_sql_running represents whether Sql_thread is working properly. If there is no, then your replication environment should need to call the police.

But just looking at these two indicators is not enough, show slave status\g output information we also need to pay attention to the value is: Seconds_behind_master "Unit is the second."

Seconds_behind_master = 0 means there is no delay, >0 indicates there is a delay, and when it is, the replication may be wrong.

In practice, I've found that many DBA friends usually judge whether Slave has a delay based on the value of Seconds_behind_master, which is acceptable in some cases, but not accurate enough. Especially in high-concurrency systems, I don't recommend using Seconds_behind_master as a criterion for determining replication latency.

In other words, Seconds_behind_master does not reflect replication latency very accurately, and we need to consider more factors.

To really determine whether slave has a delay, we should start with a copy of the Binlog file name and Binlog location consistent.

We know that show slave status\g in the output result:

Most of the time, you can confirm that there is a delay by comparing the positions performed by Io_thread and Sql_thread. (Because most of the time, Io_thread is rarely delayed, and generally there is a delay in Sql_thread, the premise of this comparison is that we can think of the execution position of the io_thread as consistent with the location of the main library show master status)

When show slave status\g outputs These metrics information to meet the above conditions, we can assume that the replication does not have a delay, is the real-time synchronization completed.

If you are worried about the delay of io_thread "such as the network bad replication environment and other factors", more rigorous thinking is the following:

The concrete realization idea is:

On the third-party monitoring node, the MASTER and SLAVE are simultaneously initiated by the show BINARY LOGS "or show MASTER STATUS" and the show SLAVE status\g request, and then according to the above formula to determine whether there is a difference.

Of course, in doing so, the scripting is relatively cumbersome, and the third-party monitoring node network requirements are very high, at the same time, because Io_thread is generally not a bottleneck, so when doing script monitoring, do not do so. Most of the time, it's enough to monitor replication latency as I said before:

There is also a way to monitor MySQL replication by maintaining a monitoring table to determine replication latency.

Can maintain a monitoring table on master, generally only two fields (ID and time), store this latest latest timestamp (high version of MySQL can be updated with Event_scheduler, the lower version can be updated with Linux cron combined with automatic looping scripts), The time of the field is then read on the slave, as long as the master and slave system time "NTP is required" is consistent, you can quickly know how much slave and master latency is poor.

Example:

This type of monitoring is known as the update heartbeat.

Also note that in high-concurrency systems, this timestamp needs to be refined to milliseconds, otherwise even if the time is consistent, it is possible to delay a number of Binlog event.

These are some of the things we need to be aware of in MySQL's traditional replication, so as long as we get the idea right, it should be logical to write a MySQL replication monitoring script. (Writing one in Python, Perl, Shell, or any other language should not be a big problem)

Then, let's get to the chase.

Adjustment of online architecture in 2MySQL replication

First of all, I want to emphasize the online architecture, because there is not much technical content in an offline (down) environment for architecture tuning.

An online architecture is a system architecture that has been put into use and is open to business users.

The online architecture means that the data is constantly being written every second, and the Log_file and log_pos of show master status are constantly changing. The data is not static, and this is the online architecture.

This is different from the development environment or testing environment in the company and is an online environment.

What we're talking about here is the replication architecture adjustment, which must be done on the online background.

In addition, the following introduction is based on MySQL traditional replication, if you are using MySQL5.6 after the Gtid copy, you can not be too concerned about the following content. Gtid replication is much more convenient than traditional replication on operations, and the contents of Gtid replication will be explained in a later article.

First combat: How to introduce VIPs in an online architecture replication environment

Suppose we have a simple master-slave environment that is online as follows:

The connection to the app is all connected to the 192.168.1.10 master.

We want to realize the memory expansion of MySQL by implementing an online switch, because slave memory is 32G.

According to the current architecture, after the switch is done, we need to modify the IP address of the application server to 192.168.1.11, then this operation will certainly affect the use of business systems, can not meet the requirements of the online architecture system.

A better solution is to introduce a virtual IP (VIP) to the current replication environment, with the VIP to switch, so the impact of online switching will be much smaller.

Suppose we apply for the VIP is: 192.168.1.20

Then we'll tie this VIP to master first:

There will be two IPs on Master's host node:

192.168.1.10 "Real_ip"

192.168.1.20 "VIP" This is our hand-tied.

What we have to consider now is the online Application Server database address connection to change 192.168.1.20:3306, how much of this impact?

The actual situation depends on your front-end application Server architecture, if you have only one front-end application server, that must still have an impact, but most of the time, the front-end application server is also the cluster architecture, can achieve stateless. In fact, as long as the front-end application server structure is slightly reasonable, the introduction of VIP operation is 0 impact.

After introducing the VIP "is actually binding a VIP on the Node1", the application server is connected to the database through the VIP, the structure is as follows:

Here's the switch, we do not introduce any third party components, what KEEPALIVED,MHA is not considered, we just write our own script to do the switch.

Let's take a look at the idea of a manual script switch, as well as the online architecture we're talking about:

1. Front-facing preparation

Master "Node1" VIP unbind "VIP unbind", while checking the original master whether there is also connection and connection processing, if there is a connection, you should consider the connection to kill:

Then source/tmp/1.sql on the line.

Master on the residual connection, pre-preparation should also pay attention to the slave read_only problem, if slave set the READ_ONLY, remember to cancel.

After master kills the connection, it can temporarily implement the data quiesce.

2. Replication consistency check required before taking over VIP

This replication consistency check is simple, the old-fashioned one we talked about earlier:

Confirm that the replication Pos are on, Node2 can take over VIP, here in the script it is best to judge consistency through the log_file and Log_pos record down.

3. Node2 Bound VIP

Slave VIP Bind

This completes the entire switching process.

To summarize, it's very simple:

    1. Replication Environment introduces VIP, modifies application Server connection database address to VIP

    2. Unbind the VIP on the original master

    3. Kill all app connections on the original master

    4. Check synchronization consistency

    5. Slave bound VIP

These actions, if the script is fully prepared, execute the script generally 3-5 seconds can be done.

Although the online architecture, 3-5 seconds or so of the short-term impact of the database, in the low-peak business application is not aware of the "switch we still need the wrong peak to execute", can definitely meet the requirements of the online architecture system.

Another point of caution is to make sure that the check synchronization is complete and that the synchronization is not complete, it will take a little time to wait until it is finished.

After the successful switch, now VIP bound in the Node2 "new Master", that Node1 again to synchronize data from Node2, node1 into slave, here are discussed:

    • If Node1 and Node2 before the dual-master architecture, then the simplest, not multi-tube "real-world architecture is the most use of dual-master, that is, after switching maintenance convenient"

    • If Node1 and Node2 only do the master-slave architecture, it is also in the previous script "check whether synchronization is completed in these 1 steps" record log_file and Log_pos, in Node1 "new slave" on the change of the master to Master_log _file=xxx,master_log_pos=xxx, then reset slave all on Node2 "new master", emptying the original show slave status\g information

Second combat: Tuning to MySQL Cascade replication architecture in an online system environment

Suppose that the existing schema is as follows:

The current architecture is:

A->b (one-way replication)

A->c (one-way replication)

How to adjust the above schema to: A->B->C cascade replication Architecture?

Again, we need to do it online and not have a big impact on the business. "Assuming that the business is performed on master, and there is no read-write separation split."

The idea here is:

    • Do not move the master, because the business is on the master "This will ensure the 0 impact on the master, to meet the needs of online business uninterrupted." Instead of trying to get slave1 and slave2 in the same POS position, as long as slave1 and slave2 are stationary, we can easily tweak the architecture:

    • To achieve the goal: let Slave1 (B) and Slave2 (C) stop at the same POS position, the data is stationary.

Operation Steps:

1. (master) A executes: CREATE TABLE Biaobiao (id int);

2. Execution on (slave) B and C: drop table Biaobiao;

3. (master) A Execution: drop table Biaobiao;

The above operation will cause B, C on the synchronization error, after the error "SQL thread error", we can start than the Sql_thread execution position (the actual purpose is to let B and C to stop in the same position)

4.slave1 (B) and Slave2 (C) show slave status\g found an error while checking the location of each copy point is equal (slave on sql_thread should stop working at this time):

The formula for checking the location of the copy point is as follows:

After checking the formulas are equal, it means that B and C have stopped to the same copy location. The data for B and C should now be exactly the same.

Both Relay_master_log_file and Exec_master_log_pos on B and C will not change at this point.

Once B and C are in the same position, the next thing is done.

At this point, C becomes the slave of B.

Completed the schema tuning A->b->c

If the script is done, make sure to do a few more show master status on B to ensure that the output does not change and quiesce, while the Pos of B and C are equal:

b node: show master status;

Sleep (10);

b node: show master status;

Because the SQL thread has been the error stuck "Sql_thread:no, Io_thread:yes", so the front and back of the show master status on B must be consistent.

b is stationary, C is stationary, and B and C are exec_master_log_pos equal.

It's not over yet, and you need to restore A->B replication, A to B copy of SQL thread is broken, and you have to recover:

Skip copy error on B, execute on B

Because, this replication error is our artificial simulation, is sure, so we can skip. Any uncertain replication errors are not easily skipped.

8. After all the final recovery, take a test to test whether the new architecture is up or down. Conditional re-validation of the data.

Here, we have completed the Replication online architecture adjustment we need.

Third combat: operation and maintenance Management in dual master replication environment

In fact, a company with a slightly larger business size will build a dual master "master" architecture when doing MySQL. Compared to one-way master-slave replication, the dual master architecture is relatively more convenient and easy to manage in operation and maintenance. Because of a one-way master-slave structure, in the case of master-slave switching, but also in the original Old master to make another change master to, while on the new master to perform reset slave all;

This is slightly more cumbersome than the dual-master replication architecture. Dual master architecture after switching, the scatter does not need to do, just drift VIP on the line "dual master and keeaplived such HA components can be very convenient to achieve automatic switching".

If it's just a two-node dual-master environment, it's important to note that it's good to keep one of the nodes in the production environment to write business data.

Then why is it clear that dual master is two-way replication, the two nodes are supposed to be two-way simultaneous writing data, I here only suggest that you write one-sided? is unilateral writing a bit of a waste of resources?

Maybe a friend will answer that there are two good parameters in MySQL:

As long as I set the two parameters correctly on the two nodes of the dual master, we can solve the problem of double write conflict, and also maximize the utilization of the server resources, slowing down the write pressure of the single station.

Indeed, the use of auto_increment_increment and auto_increment_offset can solve some of the write conflict problems, but only the insert conflict can be resolved, but the update and delete operations that are for the same row of data are encountered , the configuration of these two parameters is still uncertain.

Imagine, at the same time, on the two nodes of the dual master, the following SQL statement "assumes high concurrency at the peak of the business":

Update TB set col1=xxx where id = 1;

Obviously it will cause write conflict and lock the competition, it is no doubt that we dug a hole.

Therefore, the auto_increment_increment and Auto_increment_offset parameters can only resolve insert write conflicts, while update or delete conflicts on the same row cannot be resolved.

Perhaps you will also say, our company's business split well, in Node1 we will only write a table, on the Node2 only write B table, it is not possible to update the same row of data on two nodes problem. This sounds really a good solution to the final solution to the write conflict, but you may still encounter another problem:

If two-node write is enabled, and if one of the nodes is unfortunate, is it possible to switch the other node to a successful takeover? You will face performance bottlenecks, most of the time, you use two nodes of resources to assume read and write pressure, when suddenly become a node to bear all the reading and writing pressure, the performance of this node can withstand it?

This is a bit like Oracle RAC Architecture design, two nodes at the same time to provide services, you also have to consider a lot of performance problems, each node to do the performance of the reservation, or one of the nodes hang, and another node even if you can fail to switch over, the performance of the words is not a point.

Therefore, in the dual-master replication environment, I recommend that you use a single node anti-business, and another node is completely reserved for use as a failover, there is no need to use two nodes to bear the business pressure at the same time, because this may give us in operation and maintenance management to dig pits.

This is an operational management note for only two nodes in a dual-master architecture, but many times we also introduce one or more slave for the dual-master environment to share read (read) pressure, or to do some statistical and statistical reporting, as follows:

The above architecture is the most common architecture in our actual operations management.

On the Inside:

    • The M1 and M2 are mutually dual master replication environments, and S1 is a separate slave.

    • R_IP represents each server's real_ip (real IP), W_VIP represents the virtual VIP, all of the application's write operations should be connected to the database through the VIP, in the architecture W_vip hung on the M1, representing the M1 node is mainly used to anti-application business node. And the M2 is completely in order to be a node of failover (failover), S1 is the node that we do the statistic report (share partial reading pressure).

The first problem that we need to clarify in such a framework is:

S1 is hanging on the M1, or hanging on the M2, or hanging on the W_VIP?

    • Obviously, it is not possible to hang on W_VIP in traditional replication models because M1 and M2 binlog filename, binlog position are completely different. (M1 and M2 show Master status is different), if it is gtid copy, the principle can be hung on the W_VIP, but even gtid, generally will not do so.

In addition, if the S1 hanging in M1 this anti-business node, that when the M1 node unfortunate hanging off, in the traditional principle of replication, S1 and M2 is difficult to link up. 、

Therefore, the correct answer is: S1 must be hung on the non-anti-business of the dual master node, that is, hanging in the diagram of the M2 node, the advantage is that when the M1 unfortunate Hang, M2 and S1 replication will not be affected.

This is the point of attention when we design the architecture.

Another important point to note is that when M1 is hung up, w_vip drifts to the M2 node, ensuring that the application continues to connect properly to the database, like this:

When the M1 is hung up, after we fix the M1, M1 will continue to join the replication cluster architecture, like this:

At this point the schema becomes:

M2->s

M2<->m1

But if this is the framework, then the design basis we have just mentioned is:

    • "S1 must be hung on the non-anti-business of the dual-master node," Contrary to, here is obviously S1 hanging in the anti-business M2 architecture, is not correct, we also need to adjust the S1 in the replication structure of the location, the S1 to the M1 to the line.

We also need to figure out a way to adjust the schema from the left to the right, and we will certainly need to do this in the context of online business.

The adjustment is very similar to the one we talked about in the second actual case.

The original architecture is:

M2<->M1 (bidirectional replication)

M2->S1 (one-way replication)

Now you need to tune the schema to: M2<->M1->S1 (Cascade replication)

To achieve this adjustment, we can completely follow the second practice to say the way to do "here please return to the front of the content, and then read it carefully again."

The basic idea remains to keep the 0 effect on the M2, while the M1 and S1 are parked on the same replication point by human simulated replication errors.

Here is slightly different is because M2 and M1 mutual master, M1 on the human simulation operation will also be synchronized to M2, so in M1 on the life for the simulation operation, remember to temporarily prohibit write binlog, lest the M1 on the artificial analog operation also synchronized to M2.

The following steps are exactly the same as the steps we have made in the second combat, just copy it. I will not dwell on it.

The completion of the above adjustment is only a two-master+slave replication architecture switch completion.

Fourth combat: On-line system environment re-tune the Cascade replication architecture back to a master multi-slave architecture

The original schema is: a->b->c "Cascade Replication"

Now we need to turn the architecture into:

A->b

A->c "one Master two from"

Similarly, a node is a node that is anti-application business.

In fact, after a few of our previous cases, this case is very simple, this I will simply mention:

For this type of requirement above, stop slave on the b node, wait after stop, B and C data should be consistent:

Check the formula:

To satisfy the above formula, the synchronization of the B node and the C node is consistent.

Next, change master changes on C, changing Master to a on the C-node MySQL operation:

Note that this is to take the relay_master_log_file of Show slave status\g on the b node, Exec_master_log_pos

Start slave;

Finally, go back to the B-node MySQL operation:

Start slave;

The entire schema has been adjusted for completion.

Summarize

The above is a summary of some work, the focus is to provide you with a way of thinking, I hope that we can continue to expand the thinking on this basis.

With a clear idea, it's not difficult to write management scripts in MySQL replication operations.

If you encounter any problems in scripting replication management, you can always talk to me.

In addition, each company's business is likely to have different requirements for the online "on-line" architecture, and some companies can tolerate a 3-5-second or even longer business pause write at a specific time, while some businesses are 0 tolerant, and we should closely combine business needs Need-oriented, to better complete the MySQL replication management work.

Of course, as the MySQL version continues to evolve, such as the Gtid replication introduced by MySQL5.6, the group commit and parallel replication introduced by MySQL5.7, these new technologies have gradually made MySQL's replication operations management more convenient and efficient.

However, regardless of how technology develops and changes, the operational management techniques of mastering MySQL's classic traditional replication can still help us manage many of our online environments.

Turn from

Business 0 Impact! How to skillfully use Mysql traditional replication technology in online environment _mysql_ database-itnose
Http://www.itnose.net/detail/6525929.html

Business 0 Impact! How to skillfully use MySQL traditional replication technology "go" in online environment

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.