MySQL performance bottleneck troubleshooting instance details, mysql bottleneck instance details
This article describes how to locate MySQL performance bottlenecks. We will share this with you for your reference. The details are as follows:
Guide
Starting from a scene, we will analyze how to locate performance bottlenecks throughout the entire process.
Troubleshooting Process
After receiving an alert message about the high load of a MySQL instance at the backend of an online business, log on to the server to check and confirm.
1. check and confirm the operating system.
After logging on to the server, we first need to confirm which processes cause high loads, where these processes are stuck, and what are the bottlenecks.
Generally, the disk I/O subsystem is the most likely bottleneck on the server, because its read/write speed is usually the slowest. Even the current PCIe SSD, its random I/O read/write speed is not as fast as the memory swap. Of course, there are also many causes of slow disk I/O, which need to be determined.
The first step is to take a look at the overall load. If the load is high, it is certain that all processes are running slowly.
You can run the command w or sar-q 1 to view the load data, for example:
[yejr@imysql.com:~ ]# w 11:52:58 up 702 days, 56 min, 1 user, load average: 7.20, 6.70, 6.47USER TTY FROM LOGIN@ IDLE JCPU PCPU WHATroot pts/0 1.xx.xx.xx 11:51 0.00s 0.03s 0.00s w
Or sar-q observation results:
[yejr@imysql.com:~ ]# sar -q 1Linux 2.6.32-431.el6.x86_64 (yejr.imysql.com) 01/13/2016 _x86_64_ (24 CPU)02:51:18 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked02:51:19 PM 4 2305 6.41 6.98 7.12 302:51:20 PM 2 2301 6.41 6.98 7.12 402:51:21 PM 0 2300 6.41 6.98 7.12 502:51:22 PM 6 2301 6.41 6.98 7.12 802:51:23 PM 2 2290 6.41 6.98 7.12 8
Load average indicates how many tasks in the current CPU are waiting in queue. The more waiting, the higher the load. Generally, if the load value exceeds 5 on the server that runs the database, it is relatively high.
There may also be many causes of high load:
Some processes/services consume more CPU resources (services respond to more requests or some application bottlenecks );
Serious swap (insufficient physical memory available );
Serious interruptions (due to SSD or network interruptions );
Disk I/O is slow (the CPU waits for disk I/O requests );
In this case, we can execute the following command to determine which subsystem is the bottleneck:
[yejr@imysql.com:~ ]# toptop - 11:53:04 up 702 days, 56 min, 1 user, load average: 7.18, 6.70, 6.47Tasks: 576 total, 1 running, 575 sleeping, 0 stopped, 0 zombieCpu(s): 7.7%us, 3.4%sy, 0.0%ni, 77.6%id, 11.0%wa, 0.0%hi, 0.3%si, 0.0%stMem: 49374024k total, 32018844k used, 17355180k free, 115416k buffersSwap: 16777208k total, 117612k used, 16659596k free, 5689020k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND14165 mysql 20 0 8822m 3.1g 4672 S 162.3 6.6 89839:59 mysqld40610 mysql 20 0 25.6g 14g 8336 S 121.7 31.5 282809:08 mysqld49023 mysql 20 0 16.9g 5.1g 4772 S 4.6 10.8 34940:09 mysqld
Obviously, the previous two mysqld processes cause high overall load.
In addition, we can also see from the statistical results of the Cpu (s) line that the values of % us and % wa are high, it indicates that the current major bottleneck may be the CPU and disk I/O wait consumed by the user process.
First, we analyze the disk I/O.
Run sar-d to check whether the disk I/O is really large:
[yejr@imysql.com:~ ]# sar -d 1Linux 2.6.32-431.el6.x86_64 (yejr.imysql.com) 01/13/2016 _x86_64_ (24 CPU)11:54:32 AM dev8-0 5338.00 162784.00 1394.00 30.76 5.24 0.98 0.19 100.0011:54:33 AM dev8-0 5134.00 148032.00 32365.00 35.14 6.93 1.34 0.19 100.1011:54:34 AM dev8-0 5233.00 161376.00 996.00 31.03 9.77 1.88 0.19 100.0011:54:35 AM dev8-0 4566.00 139232.00 1166.00 30.75 5.37 1.18 0.22 100.0011:54:36 AM dev8-0 4665.00 145920.00 630.00 31.41 5.94 1.27 0.21 100.0011:54:37 AM dev8-0 4994.00 156544.00 546.00 31.46 7.07 1.42 0.20 100.00
Use iotop to determine which processes consume the most disk I/O resources:
[yejr@imysql.com:~ ]# iotopTotal DISK READ: 60.38 M/s | Total DISK WRITE: 640.34 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND16397 be/4 mysql 8.92 M/s 0.00 B/s 0.00 % 94.77 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320 7295 be/4 mysql 10.98 M/s 0.00 B/s 0.00 % 93.59 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=332014295 be/4 mysql 10.50 M/s 0.00 B/s 0.00 % 93.57 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=332014288 be/4 mysql 14.30 M/s 0.00 B/s 0.00 % 91.86 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=332014292 be/4 mysql 14.37 M/s 0.00 B/s 0.00 % 91.23 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320
As you can see, an instance with a port number of 3320 consumes a lot of disk I/O resources, so let's see what queries are running in this instance.
2. MySQL-Level Check and confirmation
First, check which queries are currently running:
[yejr@imysql.com(db)]> mysqladmin pr|grep -v Sleep+----+----+----------+----+-------+-----+--------------+-----------------------------------------------------------------------------------------------+| Id |User| Host | db |Command|Time | State | Info |+----+----+----------+----+-------+-----+--------------+-----------------------------------------------------------------------------------------------+| 25 | x | 10.x:8519 | db | Query | 68 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>404612 order by Fvideoid) t1 || 26 | x | 10.x:8520 | db | Query | 65 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>484915 order by Fvideoid) t1 || 28 | x | 10.x:8522 | db | Query | 130 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>404641 order by Fvideoid) t1 || 27 | x | 10.x:8521 | db | Query | 167 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>324157 order by Fvideoid) t1 || 36 | x | 10.x:8727 | db | Query | 174 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>324346 order by Fvideoid) t1 |+----+----+----------+----+-------+-----+--------------+-----------------------------------------------------------------------------------------------+
We can see that many slow queries have not yet been completed, and we can also find from the slow query log that this type of SQL occurs frequently.
This is a very inefficient SQL statement, which requires scanning the entire primary key. However, you only need to obtain the maximum value, as shown in the slow query log:
Rows_sent: 1 Rows_examined: 5502460
Scan more than 5 million rows of data each time, but read only one maximum value, very low efficiency.
After analysis, this SQL statement can be completed within a single digit in milliseconds after a slight transformation. Originally, it took-seconds to complete the SQL statement, improving the power of Npower.
The transformation method is to sort the query results in reverse order and obtain the first record. The original practice was to sort the results in a forward order, and take the last record and sweat...
Conclusion:
In this example, the cause of the bottleneck is better located, and SQL optimization is not difficult. In the actual online environment, the load is usually high due to the following common causes:
The amount of data read/write per request is too large, resulting in a large disk I/O read/write value. For example, a single SQL statement needs to read or update tens of thousands of rows of data or even more, this is the best way to reduce the amount of data read/write at a time;
The SQL query does not have an appropriate index for filtering conditions, sorting (ORDER BY), grouping (GROUP BY), and data aggregation (MIN/MAX/COUNT/AVG ), add an index or rewrite an SQL statement;
There are a large number of requests in an instant. Generally, as long as the peak value can be reached, the server configuration should be appropriately improved for the sake of security. In case the peak value cannot withstand the avalanche effect, it may occur;
Because of the increase in load caused by some scheduled tasks, such as statistical analysis and backup of data, this consumes a lot of CPU, memory, and disk I/O resources, it is best to run the command on an independent slave server;
The server's own energy-saving strategy will reduce the CPU frequency when the load is low. When the load is found to be higher, the system will automatically increase the frequency, but it is usually not so timely. As a result, the CPU performance is insufficient, cannot withstand sudden requests;
When a RAID card is used, BBU is usually provided (the standby battery of the cache Module). In the early stage, lithium battery technology was generally used, and regular charge and discharge are required (the DELL server is used for 90 days, and IBM is used for 30 days ), we can monitor the next time before the charging and discharging time, we can discharge it in advance when the business is low, but most of the new generation servers use capacitive batteries, so this problem does not exist.
The file system uses ext4 or even ext3 instead of xfs. In high I/O pressure, % util may have reached 100%, but iops cannot be improved, changing to xfs can be greatly improved;
The kernel's io schedop policy uses cfq instead of deadline or noop, which can be directly adjusted online or greatly improved.