A magical SQL-induced failure "go"

Source: Internet
Author: User
Tags cpu usage

A few days ago a customer database primary instance alarm, the diagnosis process was found to be caused by a slow SQL database failure, and after the investigation of the gradual depth of the discovery of this phenomenon is inconceivable.

Problem description

2016 12th 09th, about 9:26, a customer's production library Master instance issued an alarm, alarm information as follows:

MySQL instance has not been updated for more than five minutes. The simple explanation for this warning message is that the information for this instance cannot be obtained for five minutes.

At the same time, the developer also reflected that, starting December 09 1 o'clock in the morning, there have been some database request time-out phenomenon, until the alarm message occurs after the business returns to normal.

Troubleshoot problems

1, monitoring process troubleshooting

The database system uses Kangaroo cloud Easydb for performance monitoring and alerting, Easydb server can not get the agent's information, then the first task is to check whether the Easydb agent is alive.

From the visible, easydb process works fine. Then the next step is to troubleshoot the database and server level.

2. MySQL error Log

First look at the MySQL error log there is no error message records. Log on to the host, first understand the installation of MySQL:

[Email protected] ~]# Ps-ef|grep mysqld|egrep-v mysqld_safe

...

...

/home/my3304/bin/mysqld--defaults-file=/home/my3304/my.cnf--basedir=/home/my3304--datadir=/home/my3304/data-- Plugin-dir=/home/my3304/lib/plugin--log-error=/home/my3304/log/alert.log--open-files-limit=65535--pid-file=/ Home/my3304/run/mysqld.pid--socket=/home/my3304/run/mysql.sock--port=3304

...

...

By looking at the process status information, we can see the local startup MySQL instance and some MySQL configuration information, can find the log storage path.

Viewed MySQL error log for server appeal, found that a large number of such errors began December 09 09:16 A.M. and continued until 09:48

Simple analysis of the meaning of the error message:

[Email protected] bin]#/perror 11

OS error code 11:resource temporarily unavailable

The message of the error is roughly meaning that the resource is temporarily unavailable, so what does this resource mean? is thread. As you can see from here, the number of database connections is too high and contains a large number of active connection requests.

From the point of view, there are some factors that lead to a spike in database connections and the consequences of the inability of applications to connect

3. Performance Index

Next look at the database and the specific performance indicators of the host, whether the number of connections is really soaring, and there are no other abnormal indicators?

With the easydb installed, we can clearly see which performance metrics are having problems with the performance graph.

By looking at the graph of the performance indicators of the alarm instance, it is found that the number of connections has risen since about 1 o'clock in the morning December 09, and it has not been able to continue to grow around 09:16, until around 09:50 basically completely released. This is basically in line with the error time period in the errors log you just saw.

is basically consistent with the host's TCP connection trend graph

The next task, then, is to find the cause of the spike in connections or the fact that most connections are not released. The graphs of other performance indicators continue to be observed. The rest of the resource consumption was surprisingly low.

Database TPS, QPS, and CPU usage:

CPU usage and IO status of the host:

The duration of the QPS, TPS is extremely low, and the CPU and IO resources are completely idle, the period of the thread should be mostly in a blocked state.

The other key performance indicators, which appear at around 09:50 and the opposite extremes of the number of connections, are surprising.

In addition, the abnormal performance graphs include the host's IO throughput and network throughput, among other resources:

First to analyze the situation before 09:50. The performance data for this period of time, a simple overview, is the number of connections increased, and resource utilization is very low.

This situation, in general there are two factors: one is the lock wait, and the second is the slow query. The commonality of the two is the cascading congestion caused by some SQL or transaction, and eventually the global or wide range is blocked.

And then look at the lock waiting situation,

So, the analysis here, the problem can basically be positioned as a result of slow SQL

4. MySQL Slow Log

Next, the goal seems to be more clear. is to find slow SQL that appears to be blocking.

In slow log reports from 09:00 to 10:00, two SQL executions were found to take a very long time

Visually, December 08 night around 09:11 execute a SQL, this SQL execution about 12h of time! The Flush table operation, which executes on December 09 01:02, waits for the SQL execution to complete. The flush table is blocked, which means that any subsequent SQL operations will be blocked.

From this phenomenon alone, you can repeatedly clear the previous discovery of some clues:

1. Application error starting one o'clock in the morning, after flush table is blocked, from this point on, you can create a new connection, and SQL cannot execute. This is why later blocked threads are active

2.09 O'Clock in the morning more alarm information, because the database can no longer create a new connection, the agent process can not connect to the MySQL database, continuous 5s after the alarm

3.09 points 16 minutes to 09:50 around this time, MySQL error log error and connection number curve match, indicating that 09:16 thread resource exhaustion, and 09:50 left and right blocked execution, the number of connections released

4. Two class curve change abnormal relationship interpretation, slow SQL from December 08 21:11 to execute, experienced 45453s time, just about 09:50 A.M., when SQL began to return data, logical reading and full table scan indicator sharply increased, but also found that This SQL may be a sort operation.

So, what are they doing at these two points in time?

Simple communication with development, and Easydb logs are checked. The previous night 09:11 was a developer's report query, while the flush operation at 1 o'clock in the morning was an execution phase of the full-standby task.

So it seems that the developer ran a bad SQL cause?

Following the troubleshooting, it seems that the situation is not so optimistic.

Parse this SQL with the following statement:

The implementation plan is as follows:

From extras, we see that this SQL uses a temporary table sort, and it seems that the associated column of the T_business_contacts table is not indexed, and the nested loop algorithm is used

So what is the actual performance?

Developers reflect that last night's implementation, only 2s less than the time to obtain the results, and just the implementation of the situation is true.

Since the system since the run, the overall load has been relatively low, the amount of data queried, even if the SQL statement has a bad execution plan, the execution time should be more optimistic. However, from the previous point of view, one o'clock in the morning on the main library to launch a full-time task, this SQL should be still executing.

Review the system load before and after 21:16, the CPU and IO resources are basically idle, and the QPS is very low, and there is no slow SQL record for December 08.

If the data is returned in more than 1s of time, where does this slow SQL come from? Perhaps there are other errors that cause the same SQL request to be initiated, and this same SQL card is at some stage of execution? From the previous analysis, this SQL does not start returning data until 09 o'clock in the morning 50 the next day.

Look closely at the indicators, found that several system indicators in the No. 09 number 0 points around the peak, but still can't explain 09:16 executed SQL is stuck

To keep this doubt, this SQL undoubtedly has a large performance problem, we first to do an optimization process.

Add two indexes by doing the following:

Altertable t_business_contacts Addindexidx_oi (org_id), Algorithm=inplace,lock=none;

Altertable t_system_organization addindexidx_on (org_name), Algorithm=inplace,lock=none;

From the new execution plan, there has been some improvement, without the nested loop algorithm, while the in query pool is too large to cause the optimizer to have no index selected, the optimizer still uses the T_sup_qualifie_info table with a total of 4000 + as the driver table

Next look at the actual execution effect:

Implementation efficiency is nearly 40 times times the increase! At present, the optimization effect is more optimistic

5. Message Log

A case is mentioned earlier, which is the time of business recovery, which is basically consistent with the alarm time. From the previous discovery, the question now is why the business has returned to normal before the number of DB instance connections is released.

Combined with the OS message log, the confusion was quickly lifted. The log contents starting 09:16 A.M. are as follows:

As can be seen from here, the keepalived has a primary and standby switch when the target instance is unavailable, so the next business connection is the original standby instance. Easydb See the current replication topology diagram as follows:

6. SQL Execution Scenario Communication

So far, the problem of the production process, alarm reasons, business recovery process, can have a reasonable explanation. So now the question is, why is this SQL so slow?

Try to communicate with the developer to see how the SQL is performing at that time.

The developer said that it was the SQL query that was made using the client tools, and the results were aggregated to Excel after a manual query, and the query process was less than two seconds. So this SQL is a daily report query, this process has been carried out for three months, why the problem today?

Combined with the business pressure during SQL execution and the load on the system, it is almost impossible to construct a scenario that causes this SQL to take so much time. After all, this SQL execution is not hundreds of thousands of seconds, but 12h, which with the previous manual validation of 1.13s or a huge difference.

From the database of slow SQL, there is such a slow SQL execution, is positive, but with the developer's description is very inconsistent, this period may be some other anomalies, because there is no historical show processlist data, it is difficult to find the reason, See if you can reproduce similar problems in some cases next time.

Root cause Analysis

This is a typical cascading block, and the process is as follows:

1. At some point last night, a client initiated a SQL request, and for a long time did not get a response

2. Early morning automatic backup execution to flush table stage, waiting for this SQL execution to complete

3. Subsequent SQL requests are blocked by flush table

4. The number of active connections continues to increase, resulting in thread resource not available, application error

5.EasyDB Agent can not obtain the instance information for 5 minutes to trigger the alarm; Keepalived detects that the main library is not available and automatically Çeku

Pareto

1.SQL execution efficiency is too low (cause unknown)

2. Automatic backup tasks on the main library

Subsequent

After the optimization of SQL, the automatic backup task of all the main libraries was turned off that night, which prevented the main library from being unusable by similar situations;

As for what happens in SQL execution, it is really worth exploring, but unfortunately MySQL does not support the historical performance Data View, and the performance metrics analyzed by the tool are completely inconsistent with the SQL implementation, and perhaps if you can construct a scenario like this, you might find it.

A magical SQL-induced failure "go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.