Five most common database and/or instance performance problems in the RAC environment (document ID1602

Last Update:2018-06-04 Source: Internet

Author: User

Tags idn high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scope of Application of the document content details problem 1: a large number of block loss (gclostblocks, gccurrentcrlostblocks) problem 2: a large number of logfilesync waiting problem 3: long wait problem on Mutexes 4: High enq: TX-rowlockcontention, enq: TX

Scope of Application of the document content details problem 1: mass loss (gc lost blocks, gc current/cr lost blocks) problem 2: a large number of log file sync waits problem 3: long wait problem on Mutexes 4: High enq: TX-row lock contention, enq: TX-index contention, enq: TX-IT

Document Content

Purpose

Applicability

Details

Problem 1: massive block loss (gc lost blocks, gc current/cr lost blocks)

Question 2: a large number of log file sync waits

Question 3: long wait on Mutexes

Question 4: High enq: TX-row lock contention, enq: TX-index contention, enq: TX-ITL allocate entry waiting

Question 5: high CPU and memory overhead

Reference

Applicable to: Oracle Database-Enterprise Edition-versions 10.1.0.2 to 11.2.0.3 [Releases 10.1 to 11.2]
The information contained in this document applies to all platforms
Purpose

This article aims to provide a summary of the most common database and/or instance performance problems in the RAC environment and possible solutions

Please note that question 3 (massive wait on Mutexes) and question 5 (high CPU and memory overhead) are general database problems and are not specific to RAC issues. However, these problems are discussed in this article because these are one of the most common problems that will happen to an instance in the RAC environment.

Applicability

DBAs

Problem 1: massive loss of blocks (gc lost blocks, gc current/cr lost blocks)

Symptom

I. AWR reports show a large amount of loss.

II. netstat-s reports that data packets reassemble failed (reassambly failure) and lost data packets (dropped packets) increase.

Solution

Use the following documents to troubleshoot and solve the block loss problem. This document describes the symptoms, possible causes, and solutions.

Document 563566.1-gc block lost diagnostics

Question 2: a large number of log file sync waits

Symptom

The I. AWR report shows that log file sync is always in the Top 5 waiting event list.

II. The average log file sync time is long (> 20 ms ).

III. The average log file parallel write time is long (> 10 ms ).

III. The average redo write broadcast ack time or wait for scn ack time is long (> 10 ms ).

IV. The average log file sync time is short, but the log file sync waits too many times.

Background

When a user session is submitted or rolled back, the session redo information needs to be refreshed from LGWR to the redo log file. The user session waits for "log file sync" and waits for LGWR to notify you to confirm that all redo changes are stored securely on the disk.

Example:

WAIT #0: nam = 'Log file sync' ela = 977744 buffer # = 754 p2 = 0 p3 = 0 obj # = 114927 tim = 1212384670107387

Parameters:

P1 = buffer #
P2 = Unused
P3 = Unused
Obj # = object_id

All changes to this buffer (in the redo log buffer) must be refreshed to the disk and written to ensure that the transaction processing has been committed and remains committed until the instance is closed.

Typical lifecycle of "log file sync" waiting

1. Submit or roll back a user session, and then wait for log file sync.
2. LGWR collects the redo information of the redo log file to be written, releases IO, queues the BOC to the LMS process, and notifies the LMS process.
3. LGWR is waiting to refresh the redo information to the disk and wait for the SCN to be confirmed and received
4. When I/O is completed and SCN confirmation information is received from LMS, the write is completed. LGWR then notifies the foreground process to continue.
5. the foreground process is awakened and log file sync is waiting for the end.

Important Statistics and events related to log file sync

Redo write time-the total time used to write data from the redo log buffer to the current redo log file, in 10 milliseconds.
Total number of times redo writes-LGWR writes to the redo log file. In addition to the number of write redo blocks, the statistical information is equal to the number of blocks written each time.
Log file parallel write-when I/O is completed. Although redo records are written in parallel, the parallel write will not be completed until the last I/O is written to the disk.
Redo write broadcast ack time-the total length of broadcast-related wait time, in milliseconds, in addition to the log write wait time during submission.
User commits-number of user submissions. When a user commits a transaction, the generated redo that reflects the changes to the database block must be written to the disk. The commit operation usually refers to the content closest to the user transaction rate.
User rollbacks-the number of ROLLBACK statements manually published by the user or the number of errors handled by the user transaction.
The Script provided in Document 1064487.1-Script to Collect Log File Sync Diagnostic Information (lfsdiag. SQL) can be used to Collect useful Information for diagnosing log file sync problems.

Possible causes

I. The IO service time and throughput of the device where the redo log file is located are not ideal.

II. Oracle Bug. For known Oracle bugs, see WAITEVENT: "log file sync" Reference (Document 34592.1 ).

III. LMS is not running in the RT (Real-Time) class.

IV. LGWR process scheduling latency.

V. Too many submissions.

VI. Insufficient operating system resources.

Solution

I. If the parallel writing of log files is always high (> 10 ms), move the redo log files to disks with high throughput and shorter response time. The ideal range of parallel Log File writing should be 1-2 milliseconds.

II. Known Oracle Bug fixes for applications in your environment. The most effective way to obtain these fixes is to apply the latest PSU patch. Document 756671.1 provides detailed information about the latest PSU.

III. Make sure that the LMS process runs in the RT class. By default, the LMS process runs in the RT class.

IV. You can use batch submission instead of after each DML operation to reduce the number of submissions.

V. check whether any activity can be safely completed using the NOLOGGING/UNRECOVERABLE option.

VI. Check whether the commit nowait option can be used. For more information, see Document 857576.1.

@ For Support Only: Renice LGWR to run at higher priority or run LGWR in RT class by adding LGWR to the parameter: _ high_priority_processes = 'vktm | LMS | LGWR ". consider doing this only if log file sync is high and scheduling delay of LGWR is found to be causing it. be prepared to test it thoroughly.

Question 3: long wait on Mutexes

Mutexes is a more lightweight and fine-grained parallel mechanism than latches. Mutex is obtained to ensure correct management of parallel execution of specific operations. For example, if a session is changing the data structure in the memory, other sessions wait until Mutex is obtained, to make similar changes. Below are the most common Mutex-related WAITS:

A. cursor: pin S wait on X
B. cursor: pin S
C. library cache: Mutex X

Symptom ()

The AWR report displays "cursor: pin S wait on X as one of the top wait ".

Background ()

A session waits for this event when it tries to obtain the Mutex lock in shared mode, other sessions hold the Mutex lock exclusively on the same cursor object. Generally, waiting for "Cursor: pin S wait on X" is a symptom rather than a cause. Underlying optimization or known issues may be required.

Possible cause ()

Please refer to Troubleshooting 'cursor: pin S wait on x'waits (Document 1349387.1 ).

Solution ()

Please refer to Troubleshooting 'cursor: pin S wait on x'waits (Document 1349387.1 ).

Symptom (B)

The AWR report displays "cursor: pin S as one of the top waits"

Background (B)

When a session applies for a specific Mutex on a specific cursor in shared mode, although there is no parallel exclusive holder, it cannot get Mutex immediately, and then it will wait for "cursor: pin S ". This seems unreasonable because the user may not understand why such a wait exists without the exclusive mode owner. The reason for this wait is that in order to obtain or release Mutex in shared mode, the session must increase or decrease the Mutex reference count, this requires an exclusive atomic update for the Mutex structure itself. If a parallel session attempts to update Mutex, only one session can actually increase or decrease the reference count at a time. If a session cannot perform such atomic update immediately due to other parallel requests, the system waits for "cursor: pin S ".
In the RAC environment, Mutex only acts on the local instance.

Parameters:
P1 = idn
P2 = Value
P3 = where (where in 10.2 | sleeps)

Idn is the Mutex Identifier value, which matches the HASH_VALUE of the SQL statement waiting for obtaining Mutex. You can find the SQL statement using the corresponding IDN in the following format:

SELECT SQL _id, SQL _text, version_count
From v $ SQLAREA where HASH_VALUE = & IDN;

If the SQL _TEXT format displayed by the cursor is "table_x_x_x_x", this is a special internal cursor. For details about ing such a cursor to an object, see Document 1298471.1.

P1RAW uses the same hexadecimal value and can be used to search the SQL statement that matches the hash value in the trace file.

Possible cause (B)

I. A specific Mutex has a large number of parallel operations, especially on multiple CPU Systems.

II. In the case of high load, the system will wait for many different "idn" values.

III. Oracle Bugs
Bug 10270888-ORA-600 [kgxEndExamine-Bad-State]/mutex waits after a self deadlock
Bug 9591812-Wrong wait events in 11.2 ("cursor: mutex S" instead of "cursor: mutex X ")
Bug 9499302-Improve concurrent mutex request handling
Bug 7441165-Prevent preemption while holding a mutex (fix only works on Solaris)
Bug 8575528-Missing entries in V $ MUTEX_SLEEP.location

Solution (B)

I. Fixed the Bug 10411618.

II. For any identified "Hotspot" SQL, you can reduce parallel operations on a specific cursor by replacing the SQL with some variants executed by other sessions. For more information, see WAITEVENT: cursor: pin S Reference (Document 1310764.1 ).

III. Fix other known Oracle bugs. The most effective way to obtain a fix is to apply the latest PSU patch ). Document 756671.1 provides detailed information about recommended patches.

Symptom (C)

The AWR report displays "library cache: Mutex X as one of the TOP waits ".

Background (C)

In earlier Oracle versions, obtaining library cache Mutex is similar to obtaining library cache latches. In 10 Gb, Mutex is introduced for specific operations in the library cache. Starting from 11 GB, Mutex replaced library cache latches. This wait event occurs as long as a session holds the library cache mutex in exclusive mode and other sessions need to wait to release the Mutex.

Waiting separately:
Parameters:
P1 = "idn" = unique Mutex identifier
P2 = "value" = session ID holding Mutex
P3 = "where" = location (internal identifier) in the code waiting for Mutex)

System wait scope:
At the system scope level, two views are available to help diagnose this wait:

GV $ MUTEX_SLEEP (for non-rac v $ MUTEX_SLEEPS)
And GV $ MUTEX_SLEEP_HISTORY (for non-rac v $ MUTEX_SLEEP_HISTORY)

After the instance is started, these views track the usage of Mutex within the instance range. Because these views show the total number of views since startup, you can get the differences between values in a short interval when a problem occurs, so the data is very meaningful. The easiest way to understand this information is to view the "Mutex Sleep Summary" section of the AWR or statspack report.

Possible cause (C)

I. Frequent hard parsing.

II. Number of higher versions.

III. Failure and reload.

IV. Oracle Bug. For more information, see WAITEVENT: "library cache: mutex X" (Document 727400.1) to obtain a list of known Oracle bugs.

Solution (C)

I. Reduce hard parsing, invalidation, and reload. For more information, see Troubleshooting: Tuning the Shared Pool and Tuning Library Cache Latch Contention (Document 62143.1 ).

II. View Troubleshooting: High Version Count Issues (Document 296377.1) to solve the High Version Number Problem.

III. Fix known Oracle bugs in applications. The most effective way to obtain a fix is to apply the latest PSU patch ). Document 756671.1 provides detailed information about the Recommended patch.

Question 4: High enq: TX-row lock contention, enq: TX-index contention, enq: TX-ITL allocate entry waiting

A. enq: TX-Index Contention
B. enq: TX-ITL allocate entry waits
C. enq: TX-row lock contention

Symptom ()

I. AWR reports show that there is a lot of waiting on "enq: TX-index contention and enq: TX-row lock contention" in Mode 4.

II. AWR reports show a large number of branch node splits, leaf node splits and leaf node 90-10 splits

III. AWR reports (Segments by Row Lock Waits) show a large number of Row Lock Waits on a specific segment.

IV. Other waits are displayed in the AWR report, such as gc buffer busy waits on the index branch or leaf block, gc current block busy on the remote rollback segment, and gc current split.

Example:

Top 5 Timed Foreground Events

Event Waits Time (s) Avg wait (MS) % DB time Wait Class
Enq: TX-index contention 29,870 1,238 41 9.52 Concurrency

Instance Activity Stats:

Statistic Total per Second per Trans
Branch node splits 945 0.26 0.00
Leaf node 90-10 splits 1,670 0.46 0.00
Leaf node splits 35,603 9.85 0.05

Segments by Row Lock Waits:

Owner Tablespace Object Name Obj. Type Row Lock Waits % of Capture
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_PK INDEX 3,425 43.62
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_ST INDEX 883 11.25
ACSSPROD ACSS_IDX03 ACSS_ORDER_HEADER_DT INDEX 682 8.69

Background ()

If the row inserted by the transaction in the index needs to wait for the index block split being operated by other transactions to end, the session will wait for the event "enq: TX-index contention ".

Possible cause ()

I. A large number of parallel inserts lead to excessive index block splitting.

II. The index value is continuously increasing to the right for the index generated by the sequence.

Solution ()

The solution is to re-organize indexes by avoiding contention or hot spots. Options include

I. re-create an index as a reverse Index

II. Hash (Hash) Partition Index

III. If the index keyword is generated from sequence, the cache size of the sequence is increased. When the application supports it, the "unordered" sequence is used.

Symptom (B)

The AWR report shows that there is a lot of waiting on "enq: TX-allocate ITL" and "enq: TX-row lock contention" in Mode 4.

Background (B)

The session will wait for "enq: TX-allocate ITL ". Generally, the Oracle database dynamically adds another ITL slot. If the block does not have enough free space to add ITL, this may not be possible.

Possible cause (B)

Compared with the number of parallel transactions, the number of initrans is set too low.

Solution (B)

Locate a segment with a high ITL wait from the AWR report, or use the following SQL:

Select owner, OBJECT_NAME, OBJECT_TYPE
From v $ SEGMENT_STATISTICS
WHERE STATISTIC_NAME = 'itl waits 'and value> 0
Order by value;

Add the initrans value of the segment with a high ITL wait.

Symptom (C)

The AWR report shows that there is a lot of waiting on "enq: TX-row lock contention" in the exclusive mode (6.

Background (C)

The session waits for the row-Level lock held by other sessions and waits for "enq: TX-row lock contention ". This situation occurs when a user updates or deletes a row but has not submitted or rolled back, and another session wants to update or delete the same row.

Solution (C)

This is an application issue and requires application developers to participate in the search for the involved SQL statements. The following documents may be helpful for further research on this issue:

Document 102925.1-Tracing sessions: waiting on an enqueue
Document 179582.1-How to Find TX Enqueue Contention in RAC or OPS
Document 1020008.6-SCRIPT: FULLY DECODED LOCKING
Document 62354.1-TX Transaction locks-Example wait scenarios
Document 224305.1-Autonomous Transaction can cause locking

Question 5: high CPU and memory overhead

A. High CPU usage
B. High memory usage

Symptom ()

I. TOP, prstat, vmstat, and other operating system tools show that the CPU usage of users is very high, and the process that occupies the highest cpu is the Oracle process or background process.

II. The AWR report shows the maximum number of waits for one or more of the following:
Latch free
Cursor pin S wait on X or cursor pin S wait or library cache mutex X
Latch: cache buffers chains
Resmgr: cpu quantum
Enq: RO-fast object reuse
DFS lock handle

III. AWR report (SQLs ordered by buffer gets) shows that some SQL statements read a lot of buffer at each execution and consume a lot of cpu time.

IV. The stack information of a High-cpu overhead process shows that the process is continuously spinning ).

Possible cause ()

I. Oracle bugs:
Bug 12431716-Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU/GI PSU
Bug 8199533-NUMA enabled by default can cause high CPU
Bug 9226905-Excessive CPU usage/OERI: kkfdPaPrm from Parallel Query/High Version
Count on PX_MISMATCH
Bug 7385253-Slow Truncate/DBWR uses high CPU/CKPT blocks on RO enqueue
Bug 10326338-High "resmgr: cpu quantum" but CPU has idle time
Bug 6455161-Higher CPU/Higher "cache buffer chains" latch gets/Higher "consistent gets"
After truncate/Rebuild

II. Very Large SGA is configured on the Linux x86-64 platform, but Hugepages is not implemented.

III. high-availability SQL statements use unoptimized execution plans.

IV. The process cannot be found.

Solution ()

I. Fix the Bug. The most effective way to get all these fixes is to apply the latest PSU patch ). Document 756671.1 provides detailed information about the latest PSU.

II. Implement hugepages. For more information, see Document 361670.1-Slow Performance with High CPU Usage on 64-bit Linux with Large SGA.

III. Optimize the SQL statements that cause excessive buffer gets and high cpu consumption. For more information, see Document 404991.1-How to Tune Queries with High CPU Usage.

Symptom (B)

I. ps, pmap and other operating system tools show that the process in the memory is growing. Pmap displays the heap and/or stack memory of a process. For example,

# Pmap-x 26677

Address Kbytes RSS Anon Locked Mode Mapped File
00010000 496 480--r-x -- bash
0009A000 80 80 24-rwx -- bash
000AE000 160 160 40-rwx -- [heap]
FF100000 688 688--r-x -- libc. so.1

II. The session uga memory and/or session pga memory of the Oracle process are growing.

Select se. sid, n. name, max (se. value) maxmem
From v $ sesstat se,
V $ statname n
Where n. statistic # = se. statistic #
And n. name in ('session pga memory ', 'session pga memory max ',
'Session uga memory ', 'session uga memory max ')
Group by n. name, se. sid
Order by 3;

III. The number of session/process cursors keeps increasing.

Possible cause (B)

I. Oracle bugs:
Bug 9919654-High resource/memory use optimizing SQL with UNION/set functions
Many branches
Bug 10042937 high memory group in GES_CACHE_RESS AND ORA-4031 ERRORS
Bug 7429070 big pga mem alloc during parse time-KXS-HEAP-C
Bug 8031768-ORA-04031 shared pool "kkj jobq wor"
Bug 10220046-OCI memory leak using non-blocking connection for fetches
Bug 6356566-Memory leak/high CPU selecting from V $ SQL _PLAN
Bug 7199645-Long parse time/high memory use from query rewrite
Bug 10636231-High version count for INSERT... RETURNING statements
Reason INST_DRTLD_MISMATCH
Bug 9412660-PLSQL cursor leak/ORA-600 [kglLockOwnersListDelete]
Bug 6051972-Memory leak if connection to database is frequently opened and closed
Bug 4690571-Memory leak using bulk collect within cursor opened using native
Dynamic SQL

II. The cursor is exposed because the application does not explicitly close the cursor.

III. SQL statements with exceptional hash (hash) join and/or sort operations.

Possible Solution (B)

I. Bug fixes for applications. The most effective way to obtain these fixes is to apply the latest PSU patch ). Document 756671.1 provides detailed information about the latest PSU.

II. Make sure that the cursor is explicitly disabled by the application.

III. Avoid very large hash (hash) join and/or sort operations.

ReferencesNOTE: 756671.1-Oracle Recommended Patches -- Oracle Database
NOTE: 786507.1-How to Determine the Blocking Session for Event: 'cursor: pin S wait on x'
NOTE: 873243.1-Troubleshooting 'enq: TX-index contention 'Waits in a RAC Environment.
BUG: 4690571-MEMORY LEAK USING BULK COLLECT WITHIN CURSOR OPENED USING NATIVE DYNAMIC SQL
BUG: 6051972-MEMORY LEAK IF CONNECTION TO DATABASE SERVER IS FREQUENTLY OPENED AND CLOSED
NOTE: 62354.1-Waits for 'enq: Tx-Row Lock Contention '-Wait Scenario Examples
NOTE: 857576.1-How to Minimise Waits for 'Log File sync'
Bugs: 10411618-ADD DIFFERENT WAIT SCHEMES FOR MUTEX WAITS

BUG: 10636231-high version count for insert statements with reason INST_DRTLD_MISMATCH
BUG: 6356566-STATSPACK. snap cosumes almost 100% CPU
BUG: 6455161-HIGHER "consistent gets" AFTER TRUNCATE

BUG: 9591812-incorrect wait events in 11.2 ("CURSOR: mutex s" instead of "CURSOR: mutex x ")

NOTE: 1511700.1-ORA-00060 Single-Resource Deadlock Occurs on Autonomous Transaction
NOTE: 1298015.1-WAITEVENT: "cursor: pin S wait on X" Reference Note
NOTE: 1298471.1-H $ pseudo _ cursor
NOTE: 1310764.1-WAITEVENT: "cursor: pin S" Reference Note
NOTE: 1349387.1-Troubleshooting 'cursor: pin S wait on x' waits.
NOTE: 1020008.6-SCRIPT: FULLY DECODED LOCKING
NOTE: 102925.1-Tracing sessions: waiting on an enqueue
NOTE: 1064487.1-Script to Collect Log File Sync Diagnostic Information (lfsdiag. SQL)
NOTE: 1356828.1-FAQ: 'cursor: mutex... '/'cursor: pin...'/'library cache: mutex .. 'Type Wait Events
NOTE: 1357946.1-Troubleshooting 'library cache: mutex X 'waits.
NOTE: 164768.1-Troubleshooting: High CPU Utilization
NOTE: 179582.1-How to Find TX Enqueue Contention in RAC or OPS
NOTE: 361670.1-Slow Performance with High CPU Usage on 64-bit Linux with Large SGA
NOTE: 404991.1-How to Tune Queries with High CPU Usage
NOTE: 563566.1-Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
NOTE: 34592.1-WAITEVENT: "log file sync" Reference Note
NOTE: 62143.1-Troubleshooting: Tuning the Shared Pool and Tuning Library Cache Latch Contention
NOTE: 727400.1-WAITEVENT: "library cache: mutex X"

BUG: 7429070-big pga mem alloc during parse time-KXS-HEAP-C

BUG: 8031768-ORA-04031 shared pool "kkj jobq wor"

Bugs: 10042937-high memory group in GES_CACHE_RESS AND ORA-4031 ERRORS
BUG: 10220046-memory leak occurs with NON-BLOCKING app in the later version 10.2.0.4

BUG: 7199645-QUERY_REWRITE_ENABLED TAKE LONG PARSE TIME AND CONSUME ALL SERVER MEMORY
BUG: 7385253-DBWR IS CONSUMING HIGH CPU

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More