cpu95%, finding the problem SQL

Last Update:2017-01-16 Source: Internet

Author: User

Tags sessions what sql cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Receive a warning email:

Load average on Xxx_server reached critical threshold values-169.5

Current Load AVG = 169.5

It's a production environment,

And immediately boarded the server to look at the situation

top-05:01:00 up, 5:03, 4 users, Load average:185.10, 200.94, 193.27
tasks:3442 Total, 172 running, 3269 sleeping, 0 stopped, 1 zombie
Cpu (s): 96.3%us, 2.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.9%si, 0.0%st
mem:527990848k Total, 465240080k used, 62750768k free, 1593680k buffers
swap:31457276k Total, 0k used, 31457276k free, 51934560k cached

PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND
24225 Oracle 0 256g 38m 29m R 68.8 0.0 16:31.12 Oracle
34187 Oracle 0 256g 37m 26m R 68.2 0.0 25:36.82 Oracle
47581 Oracle 0 256g 40m 27m R 68.2 0.0 226:48.02 Oracle
35919 Oracle 0 256g 44m 29m R 66.4 0.0 484:24.28 Oracle
37934 Oracle 0 256g 42m 29m R 64.8 0.0 460:41.09 Oracle
24227 Oracle 0 256g 43m 30m R 63.9 0.0 477:54.11 Oracle
10826 Oracle 0 256g 44m 30m R 63.0 0.0 518:04.92 Oracle
64990 Oracle 0 256g 148m 30m R 62.7 0.0 960:40.23 Oracle
26610 Oracle 0 256g 33m 26m R 60.2 0.0 5:10.04 Oracle
31533 Oracle 0 256g 37m 27m R 59.9 0.0 25:29.96 Oracle
56682 Oracle 0 256g 41m 30m R 59.3 0.0 443:13.79 Oracle
39668 Oracle 0 256g 37m 26m R 58.6 0.0 27:16.07 Oracle

Seen several times, CPU usage has been more than 95%, the first 20 are Oracle related processes, so immediately see what these processes are what SQL is doing

Just looking for a CPU that has a higher percentage of the SPID.

Select A.sid, A.serial#,a.username, A.osuser, B.spid,a.status

From V$session A, v$process b

where A.paddr= b.addr

and b.spid=40514

Order BY A.sid;

SID serial# USERNAME osuser SPID STATUS

---------- ---------- ------------------------------ ------------------------------ ------------------------ ------- -

2793 883 msp_all_dml app 40514 ACTIVE

Sql_text

----------------------------------------------------------------

UPDATE Sync_processing_status

SET Status_ind =: 1, Retry_count =: 2,

Error_msg =: 3, Update_date =: 4

WHERE sync_id =: 5 and Process_type =: 6

and Create_date > Sysdate-7

And then found one or two more of the SPID, also in the execution of this SQL

Look at what the database is doing in the session

SID serial# sql_id run_in_sec os_user Machine

---------- ---------- ------------- ---------- ------------------------------ -------------------------------------- --------------------------

Sql_text

-----------------------------------------------

MODULE EVENT

---------------------------------------------------------------- ----------------------------------------------- -----------------

6826 13187 0vqfhz794y0ys 7 app Xxxx_appserver

UPDATE sync_processing_status SET Status_ind =: 1, Retry_count =: 2, ERROR _msg =: 3, UPDATE_DATE =:4 WHERE sync_id =:5 Process_type =:6 create_date > Sysdate-7

JDBC Thin Client Latch Free

10990 20293 0vqfhz794y0ys 7 app Xxxxx_appserver

UPDATE sync_processing_status SET Status_ind =: 1, Retry_count =: 2, ERROR _msg =: 3, UPDATE_DATE =:4 WHERE sync_id =:5 Process_type =:6 create_date > Sysdate-7

JDBC Thin Client Latch Free

10994 18173 0vqfhz794y0ys 7 app Xxxxx_appserver

UPDATE sync_processing_status SET Status_ind =: 1, Retry_count =: 2, ERROR _msg =: 3, UPDATE_DATE =:4 WHERE sync_id =:5 Process_type =:6 create_date > Sysdate-7

JDBC Thin Client Latch Free

6457 48363 0vqfhz794y0ys 7 app Xxx_appserver

UPDATE sync_processing_status SET Status_ind =: 1, Retry_count =: 2, ERROR _msg =: 3, UPDATE_DATE =:4 WHERE sync_id =:5 Process_type =:6 and create_date > Sysdate-7

JDBC Thin Client Latch Free

(There are many output, omitted here)

See basically all is in the execution of this sql,active session about 170 around in performing this.

Then ran to the OEM to see the implementation plan,

A full table scan is being done.

Notice the key point, the plan hash value has two, so guess is not the execution plan changes caused by the SQL execution is very slow, too many sessions caused the hotspot fast, resulting in latch free wait event, thus the CPU soared.

But there was no certification to analyze the roots of latch.

Keep looking,

The previous walk was the index range scan, which now goes to full table scan.

And the key message is that the statistics for this table are still in the 2015,

Execute dbms_stats.gather_table_stats (ownname=> ' xx ', tabname=> ' sync_processing_status ', estimate_percent= > 5, Cascade = true, Degree = ' 8 ', no_invalidate = false);

After collecting the statistics, load is OK and the CPU is down.

==============ended====================================

The problem was dealt with, but more information could be dug up, such as the analysis of the latch free wait event,

This event will be caused by a number of reasons, as listed here:

Shared Pool and the Library Cache Latch reason for contention

Shared pool and library cache latch contention are usually caused by hard parsing. Hard parsing requires a new cursor to be allocated, or a cursor that has been swapped out is re-executed. Too many hard analyses indicate that the SQL statement does not have a fully bound variable. Hard analysis is a costly operation that needs to hold the Ibrary cache latch for the duration of the analysis.

If you have a large number of constant SQL statements in your system, the contention for Sharedpool latch and the library cache latch will be significantly reduced when you change them to full use of the binding variable. Changing SQL statements, using bound variables, usually requires modifying the application.

Sharedpool Latch contention reason two -too large shared pool

From Oracle9i onwards, the large shared pool is no longer a bad thing due to the introduction of multiple sub-pooled features. Before 9i, too large shared pools often cause sharedpool latch contention. The available memory in the shared pool is divided into different memory blocks (chunk), and blocks of different sizes are managed by different available lists (freelist). When allocating space in a shared pool, you need to scan the available list, which is required to hold the shared pool latch during the scan. Too large a shared pool can make the available list too long, allowing the sharedpool latch to hold longer. In high concurrency environment, latch holding time can cause latch contention (performance is higher sleeps and misses values), especially the large number of systems using constant SQL, for such a system, do not blindly want to increase the sharing pool, It is more important to think about why you would need to save so many statements that cannot be shared to the shared pool.

Library Cache Latch contention reason three -statement version number too many

For SQL statements that have exactly the same character but cannot be shared because they reference different objects, Oracle uses multiple child cursors to point to different versions of the statement. For example, there are three tables in the system called customer, but they belong to different schemas. For the statement select * FROM customer, the statement is executed in a different pattern, exactly the same as the hash value, but the statement cannot be shared because it references different objects. Therefore, different child versions of the statement are generated. When a SQL statement has more than one version, Oracle needs to compare all the existing child versions of the statement, during which time it is necessary to hold librarycache latch, which may cause the library cache latch contention. It is also easy to solve this situation, and try not to use the same object name in the system.

Cache buffers Chains Latch

Cache buffers Chains latch contention reason one-inefficient SQL statement

Inefficient SQL statements are the main cause of cache bufferschains latch contention. In high-concurrency systems, the Atch free time may be so obvious. Typically, an application opens multiple concurrent sessions to perform the same inefficient SQL and accesses the same data set.

Note the following three points:

1. Each logical read requires a latch request.

2. The continuous request for the latch will not be stopped until a certain latch has been obtained.

3. At some point, only one process can get the cache buffers chains latch, and the latch may be used to protect a lot of chunks, some of which may be being requested by other processes (of course, as mentioned earlier, Oracle9i allow read-only cache bufferschains latch sharing).

In general, fewer logical reads mean fewer latch requests, which means less latch contention and better system performance. Therefore, you should find the inefficient SQL statements that lead to cache bufferschains latch contention, optimize these statements, and minimize their logical reads. Those SQL that buffers_get/executions ratio is probably the statement you need to adjust.

Cache buffers Chains latch contention reason two -hotspot block

Hotspot blocks are another major cause of cache bufferschains latch contention. This problem is caused when multiple processes repeatedly access one or more blocks that are protected by the same cache buffers chains latch. This is usually caused by an application. In this case, increasing the number of cache bufferschains latch has no effect on contention caused by hotspot blocks. Because the data block is distributed in which hash bucket and hash chain is determined by the block address (Dba:data block addresses) and the number of hash buckets, and the number of hash latch is not related. As long as the block address and the hash bucket number has not changed, these hotspot blocks will still be distributed in the original hash bucket and hash chain, or by the original hash latch protection, then will also be on these hash latch to generate contention. Unless there is a significant increase in the number of latch in the system (so that each latch managed hash bucket will be very small, or even a latch to manage a hash bucket, so that the original hotspot block may have a few other latch to manage, And no longer need to contend with the original latch).

The best way to solve such a cache bufferschains latch contention is to find the hotspot block. The P1raw parameter of the latch free wait event can be used to know if the hotspot block is causing the latch contention. (in oracle10g, the related wait event for the cache buffers chains latch is no longer latchfree, but cachebuffers chains). The P1raw parameter is the address of the latch. If multiple sessions are waiting for the same latch address, congratulations on the hotspot block problem.

Cache buffers Chains latch contention reason three --too long hash chain

Multiple data blocks may be assigned to the same hash bucket. These blocks form a linked list (hash chain). In a large system, there may be hundreds of blocks of data in a hash bucket. Searching for a block from a hash chain list requires the cache bufferschains latch, serial. If the list is too long, the latch holding time increases correspondingly, which may cause other processes to request the cache buffers chains latch failure.

-----------------------------------------------------------------------

If you want to further analysis, you can use the following SQL to see this latch specific information

v$session Latch Free wait event P2 refers to the type of latch, and then use the following SQL to view the latch type

SELECT latch#, NAME from v$latch WHERE latch# = 98;

SELECT * from V$latchholder;

Sql>select latch#,name,gets,misses,sleeps from V$latch where name like ' Cache buffer% ';

In this query result, we can see that all CAHCE buffer chains's latch have been recorded since the database was started, and get indicates that there are so many requests, and that misses indicates the number of failed requests (locking is unsuccessful), and sleeps Indicates the number of times the request failed to hibernate, through sleeps we can generally know whether the competition of latch in the database is serious, which also indirectly represents whether the problem of Hotspot block is serious . Since V$latch is an aggregated information, we are not able to obtain which blocks may have frequent access. So we're going to look at another view message, which is the address of this latch that V$LATCH_CHILDREN,V$LATCH_CHILDREN.ADDR records.

Sql>select addr,latch#,child#,gets,misses,sleeps from V$latch_children where name = ' Cache buffers chains ' and rownum < 21;

The main analysis of latch free wait events is these v$latch,v$latchholder,v$latch_children,v$latchname,v$latch_parent so many views, Later there is time to analyze the usage of these views in detail, and to determine the idea of a block hotspot from latch.

============================

Of course, there are many reasons for this kind of execution plan mutation, this example is because of statistical information, there are many other reasons, such as the more common binding variable snooping (data volume mutation) and so on, later encountered in the specific analysis.

After a glance, the subsequent SQL execution plan is also going to be the index range scan, the hash plan value 1107144300 execution plan.

Sql> SELECT DISTINCT sql_id, Plan_hash_value
2 from Dba_hist_sqlstat Q
3 WHERE snap_id between 130811 and 130812
and q.sql_id= ' 0vqfhz794y0ys '; 4

sql_id Plan_hash_value
------------- ---------------
0vqfhz794y0ys 3910147552
0vqfhz794y0ys 1107144300

Select Sql_id,first_load_time,executions,loads,parse_calls from V$sqlarea where sql_id= ' 0vqfhz794y0ys ';

Select Sql_id,plan_hash_value, Child_number, executions, Parse_calls, Buffer_gets, Is_bind_sensitive, Is_bind_aware From v$sql where sql_id = ';

==========================ended===================================

cpu95%, finding the problem SQL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More