Objective
Unconsciously, the Technology Life series • I and the data center story came to the fourth issue. Small y again to meet with you!
What do you do when you see a business system that is measuring the following wave-shaped TPS curve?
650) this.width=650; "src="%5c "title=" \ "1.png\"/"alt=" Wkiol1lwsyyr5meoaaap7zlxa7y498.png-wh_50 "/>
Small Y (medium also technology) today is to share with you is such a business system of the analysis and resolution of the performance problem. This problem has plagued the customer for quite some time, and fortunately, little Y has been positioning the problem for 10 minutes remotely and helping the customer finally solve the problem. It is necessary to note that in this case, it is not enough to only adjust the database parameters, but also to do other analysis and adjustments to solve the problem.
In order to maintain the original, while increasing the interest of the article, Little Y will continue to adhere to the previous analysis of the writing, will try to introduce some problems to deal with the psychological process, I hope that friends can understand the small y of the real working state.
What can we learn
Oracle database must be tuned for an important performance-related parameter on 11.2.0.3 and above!
How to persist or quickly adjust the problem detection direction after the diagnosis fails!
How to master the initiative in dealing with cross-team/departmental integrated issues!
Warm tips
If you are in a high-concurrency, transactional OLTP core business system, there are often some performance jitter. For example, the sudden sharp rise in transaction response time, along with the number of AP server port/Process activity/JDBC The number of connections, database db time per second rise and so on, and the Oracle version in 11.2.0.3 or above, it is likely to be related to an important parameter mentioned in this article Oh! If you cannot resolve the parameter after adjusting it, you can contact small y diagnostics (Mian Fei de).
Part 1
Here's the problem.
10 o'clock in the morning, QQ suddenly flashed up, to live!
Little y, are you free? Help to see the awr.
I'll talk to you on the phone for a moment.
This year their new key business system, in the pre-launch stress test, the application of concurrency can not reach the pre-launch of the concurrency indicators and response time indicator requirements. The curve of TPs at the time of pressure measurement is as follows: The QQ message is the DBA of a large domestic airline, the general problem he can solve their own, this time it looks like he was in trouble.
650) this.width=650; "src="%5c "title=" \ "2.png\"/"alt=" Wkiom1lwsb_gvw6raaap7zlxa7y931.png-wh_50 "/>
It can be seen that the TPS at the time of pressure measurement is wavy and extremely unstable.
Customers have done a lot of analysis, resource level, CPU, memory usage, io are normal, but the customer himself also found that the back end of the test when the Oracle database has a large number of abnormal waiting, mainly the waiting for the GC type, the customer is suspected that the private network switch is not a problem. Unfortunately, the network team did not check for anomalies.
This issue, they also asked the existing Oracle service providers to analyze, but the problem has not been resolved. In this way, from the business system requirements on-line time is getting closer, customer pressure is getting bigger!
Little Y has been with this customer, from the heart sincerely hope to have the opportunity to provide services for them, so the opportunity to come, small y naturally is playing the more than spirit, ready to start fighting.
Environment Introduction:
Operating system Redhat bit,64c 128G
Database Oracle 11.2.0.3, 2-node RAC
Part 2
Analysis process
2.1 Analyzing DB time per second for Oracle database
650) this.width=650; "src="%5c "title=" \ "3.png\"/"alt=" Wkiol1lwsc6bzowvaaaozpocydw126.png-wh_50 "/>
With DB time divided by elapsed, we can see that DB time per second reaches 75! This is extremely abnormal.
Indicates that the database is experiencing a serious wait and needs to review the database top wait events for further analysis.
2.2 Analyze where the trading hours are consumed (TOP 5 wait event)
1) The Node 1 wait event is as follows:
650) this.width=650; "src="%5c "title=" \ "4.png\"/"alt=" Wkiol1lwseax96nwaaa-fke6le0856.png-wh_50 "/>
Event Analysis
Oracle Top 5 waits,GC buffer busy acquire ranked first, accounting for 51.2%, averaging every wait time to an astonishing 277 milliseconds! The GC buffer busy acquire here indicates that process B before process a has previously requested the same block of data for node 2 , and is not yet complete, so it is waiting.
in the second place is log file sync, which accounts for 18.58%, averaging 293 milliseconds at a time. The log file Sync here means that when the process makes a commit , it needs to wait for the LGWR background to persist the change vectors in the log buffer to disk. Wait in the redo log .
in the third place is the DB CPU, in one hours of sampling, the total wait time is 24648 seconds, that is, about 7 CPU time, the server configured CPU, so the average CPU usage is only used for 10%. Here's little y by the way, usually, we expect the higher the ratio of the DB CPU the better, which means that SQL does not have to wait in the process of execution,SQL The faster the response time. But does not mean that there is no problem, such as high-logical read SQL, if the data to be manipulated in memory, will also cause the db CPU is too high, it is necessary to optimize the high logical read SQL , thereby reducing the db CPU.
The fourth place is direct path read, and the average wait time is 153 milliseconds. The direct path read here indicates that the process directly reads the data block into the PGA memory instead of reading it into the buffer cache shared memory. In this case, theIO throughput will obviously be greater, and each process will read its own, even the same data. If different processes read the same data at the same time, and read into the shared memory, then only one process is responsible for reading, and the other processes directly manipulate the data in memory, where IO throughput is much smaller.
in fifth place is the buffer busy wait, the average wait time to the amazing 499 milliseconds. The buffer busy wait here means that when two or more than two processes require simultaneous write / write, write / read operations on a block of data, a hot block conflict occurs.
See here, little y already basic know the answer!
However, from a rigorous point of view, the RAC 2-node waiting event is also slightly over.
2) The Node 2 wait event is as follows:
650) this.width=650; "src="%5c "title=" \ "5.png\"/"alt=" Wkiol1lwsfeswjflaaa9ibr7g9m704.png-wh_50 "/>
Compared to Node 1, there is no buffer busy wait, more GC current block busy.
Overall, the two-node wait event is not a big difference!
2.3 First 2 minutes, little Y's brainstorming.
2.3.1
is the problem of RAC private network?
See here, maybe someone will say:
If the GC waits that high, does it turn off the other RAC node temporarily, and the problem is solved?
The first answer is no!. Second, it is possible to use it temporarily when producing an emergency GC performance problem, but for such a case, the customer is obviously not acceptable.
Little Y answers the question from a technical level.
First, as shown in
The two-node private network is only 3M per second, while the RAC two servers are configured with gigabit switches for private networks.
650) this.width=650; "src="%5c "title=" \ "6.png\"/"alt=" Wkiol1lwsgrb4jseaaawelmlocc876.png-wh_50 "/>
Second, the CPU and memory of the RAC two nodes are low, and there is no situation where the resource problem causes one node to run slowly, which in turn causes the GC request to be unable to respond quickly to the other node.
If this is the case, we can usually see the gc*congested* type of wait (congestion).
2.3.2
is SQL efficiency causing the problem with Gc/bbw/direct path read?
Where BBW is the buffer busy wait,
GC is a GC that represents GC buffer busy acquire wait.
Maybe someone would say:
The GC waits so high, and the buffer busy wait, if SQL is efficient enough, then the number of data blocks to be accessed is less, so there are few GC requests in the process, and the hot block conflicts caused by read/write are not.
The answer is no!.
See, can see the application or write pretty good, most of the SQL control in 100 logical read below, only 3 SQL logic read in thousands of to tens of thousands of, such SQL efficiency and the number of logical reads is not enough to cause such a high GC/BBW wait! In addition, fall to SQL efficiency is not high this point, there is no way to explain the log file Sync/direct path read is also in the average single time waiting! the wrong direction is not to solve the fundamental problem! in other words, even if you spend more time optimizing the logic to read a slightly higher number of SQL, the problem of the pressure test will still exist, because this is not the root cause, the optimization of SQL for this case is icing on the cake rather than timely!
650) this.width=650; "src="%5c "title=" \ "7.png\"/"alt=" Wkiom1lwshvqbw5jaabcpkqd4k4635.png-wh_50 "/>
2.3.3
is direct path read causing an IO bandwidth full problem?
Some might say, will there be such a possibility:
First, direct path read causes the IO bandwidth to be fully occupied
Description: Multiple processes to read the data block into the PGA private memory instead of buffer cache shared memory, in a multi-block read 16 calculation, each block 8K, each process can read about 30M, more than 15 processes at the same time more than one block read the HBA card bandwidth can be full, set 10949 Event to suppress this attribute.
Because the IO bandwidth is full, it affects the response time of the LGWR write log, which in turn causes log file sync to wait longer.
The log file sync is also a part of GC and buffer busy wait, which pulls up the GC and buffer busy wait times, so the AWR report waits?
First of all, can make this hypothesis friend, can send small y send a resume, indicating that you have a very deep understanding of the database, and have a very rich troubleshooting experience, but also is not a fragmented analysis of the level of the problem!
You are welcome to join the Technical DBA team! Here we are, brother. We fight together, challenge various problems together, share the benefits together!
Please send your CV to [email protected]
So what does log file sync have to do with GC?
Reference a diagram of a RAC SG, where the principle is as shown
650) this.width=650; "src="%5c "title=" \ "8.png\"/"alt=" Wkiol1lwsi_ygyi4aacsk4zhrlk833.png-wh_50 "/>
From what you can see:
GC class request, in the second step contains the process of LGWR process write log,
That is, log file Sync is a sub-step of the GC request, and strictly speaking, this step is called GC Log flush sync.
But the answer is still no!.
As you can see from load profile, the physical reads per second are 498 blocks, each block is 8K, that is, the IO per second is about 4M. The IOPS and IO bandwidth are very low, obviously not the problem!
650) this.width=650; "src="%5c "title=" \ "9.png\"/"alt=" Wkiom1lwsj6xi6u3aabllabiqe4282.png-wh_50 "/>
2.3.4
2. Small y fast lock problem analysis direction!
Little y This two minutes like the above analysis, the rapid conduct of various assumptions and exclusions, problems in series.
Soon small y locked the analysis direction of the problem-that is, to put the analysis focus on the log file sync wait!
The reason is simple, by analyzing the top 5 wait, it is not difficult to see that they are related to the relationship between:
Log file sync is a part of GC and buffer busy wait! (See figure in 2.3.3.)
If log file sync waits to be resolved, the natural gc* waits and the buffer busy wait will go down!
The problem will be solved!
2.4 Focus on "log file sync" wait
From the above, we already know that "log file Sync" waits for an event to indicate:
When the process issues a commit, wait for the LGWR background to persist the change vector in the log buffer to the redo log in the disk. Therefore, the most common is the LGWR write log write slow, or because the commit too often caused!
The next small y examines these two aspects in turn.
In Oracle, if the LGWR write log writes slowly, the log file parallel write single response time is slow.
Node 1
650) this.width=650; "src="%5c "title=" \ "10.png\"/"alt=" Wkiom1lwsk7c7otpaaasdikiuig812.png-wh_50 "/>
Node 2
650) this.width=650; "src="%5c "title=" \ "11.png\"/"alt=" Wkiol1lwslbc4qogaabrnoq1omu881.png-wh_50 "/>
As you can see, two nodes either log file parallel write or GC log flush Sync are only under 5 milliseconds, where log file parallel write is only 1 milliseconds and 3 milliseconds. Eliminate the problem!
Next check the commit count!
As shown, there are only 48 transactions (Commits/rollbacks) per second!
Small y service Some of the large banks of the high-concurrency core system, including the number of transactions per second at more than 10000, log file sync is also controlled within 10 milliseconds. So only 48 transactions per second is a very small indicator, not to cause such a serious wait!
650) this.width=650; "src="%5c "title=" \ "12.png\"/"alt=" Wkiom1lwsmrwpjekaabllabiqe4445.png-wh_50 "/>
2.5 Reasons for basic positioning and starting the first adjustment
Analysis here, small y has already found the root cause of the pressure test, only need to adjust the verification.
Suggest friends, read here can also first stop, see if you find the cause of the problem.
That is, two minutes after the customer AWR report was sent, Little y told him
"I know why, you sent me the trace of the LGWR process. I'll make a final confirmation and we'll start adjusting."
In fact, it is not surprising that the case of small y in a few years ago to do a lot of system upgrade to 11g encountered n times! The customer was surprised, he didn't even have time to call small y, small y how can this ...
This should be a pre-launch standard, although the phenomenon is not the same, but essentially a problem.
This is also a small y title to focus on a key point of the database parameters.
If the log file sync waits for a long time, but LGWR writes the log quickly, and the commit count is small, there is a problem in the communication between the commit process and the LGWR.
Key points of knowledge:
Starting with 11G, Oracle introduced the polling mechanism for LGWR write logs, whereas previously only the post/wait mechanism.
At the same time, an implicit parameter is introduced, "_use_adaptive_log_file_sync", which is adaptive switching between two mechanisms. Under 11.2.0.3, the default value for this parameter is False, which is to enable only the post/wait mechanism.
Starting with 11.2.0.3, the default value for this parameter is true, which means that Oracle adapts to the post/wait mechanism and polling mechanism.
Post/wait, the LGWR process after writing the change vector in the log buffer, immediately notify the process to commit, so the log file sync wait time is short, but LGWR relatively, the burden is heavier. After all, 12C the following LGWR process only 1, when the concurrent commit process is more, notify the process to commit is also a burden.
Polling mode, to commit the process, notify the LGWR process after the write operation, will enter the sleep link, and after timeout to see if the contents of the log buffer is written to the disk, the LGWR process is no longer separate notification to commit the process write has been completed. Under the polling mechanism, a part of LGWR's work is liberated, but the process that will bring the commit is left in log file sync for a long time. For a trading system, this mechanism is extremely non-applicable!
Between the post/wait and the polling mechanism, Oracle logs the trace to the LGWR process, as shown below.
When switching to polling mode, it is easy to cause log file sync to wait and affect the response time of the transaction!
Log file sync switching to polling
......
Log file sync switching to Post/wait
Under Oracle 11.2.0.3, it is recommended to turn off the adaptive log file sync, so make sure that the LGWR process runs under the post/wait mechanism to ensure that database performance does not show a big jitter! Close the command as follows, can be modified online! So, little y is here to remind you
Alter system set "_use_adaptive_log_file_sync" =false sid= ' * ';
Yes, the first adjustment for little Y is to adjust the parameter to False.
2.6 First Adjustment results let's be disappointed!
After adjusting the parameters online, the client restarted the two node databases for security reasons.
and re-made the stress test, the re-collected AWR report is as follows!
650) this.width=650; "src="%5c "title=" \ "13.png\"/"alt=" Wkiol1lwsohthygkaadbqo331cy162.png-wh_50 "/>
See this AWR Report for Node 1, GC wait and log file sync wait still, and it looks like a single wait for a longer time!
Is there a problem with the analysis of Little y? Or is it that little y encounters a couple of mixed problems this time? Calm down, RAC problem, remember to only look at a single node, so small y let the customer out of the AWR Report of Node 2, adjusted after the AWR report Node 2 as shown in:
650) this.width=650; "src="%5c "title=" \ "14.png\"/"alt=" Wkiol1lwso7c04hmaabtc2dw_m0307.png-wh_50 "/>
can see:
Although the wait is still there, the log file Sync for node 2 is not waiting! This shows that this adjustment still has the effect!
and attentive friends, may have found that the first bit of Node 1 waits for GC buffer busy acquire completely gone (Note Node 2 log file sync is fast), from GC buffer busy acquire into GC buffer busy rele Ase. Doesn't that just mean the adjustment or the effect?
Here, don't worry, here because node 1 still exists log file sync, so the GC buffer of Node 2 busy acquire still exist! So next, small y will concentrate on resolving the Node 1 log file Sync!
2.7 Truth surfaced (doubting everything)
Summing up, here to adjust the log file Sync Adaptive, the problem is still not resolved, then back to the traditional thinking, the most likely problem that is still LGWR process write log slow! Although the log file parallel write indicator in the AWR report is only a few milliseconds, the AWR report is, after all, a tool that provides only a reference value, so we still have to hold on to the attitude of suspicion, and then check again!
This time, let's watch the LGWR process write logs in real time. Issue the SQL statement with the result as shown:
650) this.width=650; "src="%5c "title=" \ "15.png\"/"alt=" Wkiom1lwsqstp93maabk_xy-qvq611.png-wh_50 "/>
can see:
Of the RAC two nodes, only 1 nodes appear log file parallel write waiting, just as well as all the previous analyses match each other!
In the case where state is waiting, the log file parallel waits for seq# of 35693, but seconds_in_wait reaches 21 seconds. Simply put, it takes 21 seconds for the LGWR process to write an IO!
At this point, we can be sure that the IO subsystem has a problem, it needs to focus on the IO path of the fiber cable, SAN switches, storage error and performance conditions.
2.8 How to further prove that there is a problem with the IO path (cross-sectoral cooperation)
Taking into account that the customer side of the storage Team/department may not recognize the database Io slow evidence, and in order to allow the other side to increase troubleshooting, small y Let the customer issue the following command to view the Multipath software IO situation, as shown in the result:
650) this.width=650; "src="%5c "title=" \ "16.png\"/"alt=" Wkiom1lwsrsqf81haabih5pbpou089.png-wh_50 "/>
There is a noticeable IO error on node 1, and it continues to increase!
Continue to check Node 2 and discover that there is no IO error! on Node 2
Recalling the previous analysis, Node 2 adjusts the database adaptive log file sync to False, and no IO error has occurred, so there is no log file sync.
This concludes the analysis! All the questions have been perfectly explained!
Find the reason, also got a strong convincing evidence, customers finally relieved, not afraid of storage team not admit!
2.9 problems are solved satisfactorily
In the face of iron evidence, the customer's storage team did not struggle, but began to take seriously one by one in the investigation, and eventually after the replacement of the fiber-optic cable After the problem was satisfactorily resolved. Here are the wait events that are re-measured after replacing the fiber optic cable!
650) this.width=650; "src="%5c "title=" \ "17.png\"/"alt=" Wkiol1lwssbi-gykaaa6cwudcds508.png-wh_50 "/>
The measured TPS curve is from the original wave shape:
650) this.width=650; "src="%5c "title=" \ "18.png\"/"alt=" Wkiom1lwstsgcafnaaap7zlxa7y526.png-wh_50 "/>
into a good curve like the following
650) this.width=650; "src="%5c "title=" \ "19.png\"/"alt=" Wkiol1lwsu7d2av4aaasx-5dkfm476.png-wh_50 "/>
Part 3
Problem cause and summary and risk tips
3.1 Summary of Causes of problems
The airline customer's business on-line pressure measurement cannot reach the concurrency and response time metrics because of the simultaneous encounter of two mixed problems:
1) The log file sync on Oracle 11.2.0.3 is turned on adaptively by default, and when switched to polling mode, the log file sync wait time is longer, and log file Sync is a part of GC and buffer busy wait. So cause a lot of waiting
Minor y fixes a part of the log file sync wait problem after adjusting "_use_adaptive_log_file_sync" to False
2) due to the quality problem of the fiber cable in the node 1, the IO error is caused, and then the IO is re-sent, which affects the performance of the LGWR write log.
After adjusting the database parameter defaults and replacing the fiber optic cable, the problem is resolved satisfactorily.
The measured TPS curve from the original wave shape
650) this.width=650; "src="%5c "title=" \ "20.png\"/"alt=" Wkiol1lwsvusqhp4aaap7zlxa7y993.png-wh_50 "/>
into a good curve like the following
650) this.width=650; "src="%5c "title=" \ "21.png\"/"alt=" Wkiom1lwswxacrtvaaasx-5dkfm707.png-wh_50 "/>
3.2 Key points to solve the problem review
1) Analyze the Oracle wait events without fragmentation
Small y in this case by combing wait events in common is log file sync, thus finding a breach
2) Understand the characteristics and behavior of different versions of the database
Small Y is constantly understanding the new features of the 11g, and through a lot of fault processing in-depth understanding of these features, so when the log file sync appears, you can quickly locate the new feature caused
3) do not fully trust the AWR report, he is only a tool, to suspect everything to verify.
In this case, the AWR report metrics do not really reflect the LGWR write performance,
4) An Oracle service personnel, if only the database, will appear you suspect this suspicion that, but others are not admit, so must master more including operating system, storage, network, middleware skills. Of course, to find a comprehensive service provider is also a good choice.
In this case, small y found direct evidence through multipath commands, and eventually the other team's large-scale troubleshooting was the key to the final resolution of the problem.
Risk Alert
Starting with 11G, Oracle introduced the polling mechanism for LGWR write logs, whereas previously only the post/wait mechanism.
At the same time, an implicit parameter is introduced, "_use_adaptive_log_file_sync", which is adaptive switching between two mechanisms. Under 11.2.0.3, the default value for this parameter is False, which is to enable only the post/wait mechanism.
Starting with 11.2.0.3, the default value for this parameter is true, which means that Oracle adapts to the post/wait mechanism and polling mechanism.
Post/wait, the LGWR process after writing the change vector in the log buffer, immediately notify the process to commit, so the log file sync wait time is short, but LGWR relatively, the burden is heavier. After all, 12C the following LGWR process only 1, when the concurrent commit process is more, notify the process to commit is also a burden.
Polling mode, to commit the process, notify the LGWR process after the write operation, will enter the sleep link, and after timeout to see if the contents of the log buffer is written to the disk, the LGWR process is no longer separate notification to commit the process write has been completed. Under the polling mechanism, a part of LGWR's work is liberated, but the process that will bring the commit is left in log file sync for a long time. For a trading system, this mechanism is extremely non-applicable!
The switch between the binaries is logged to the trace in the LGWR process, as shown below.
When switching to polling mode, it is easy to cause log file sync to wait and affect the response time of the transaction!
Log file sync switching to polling
......
Log file sync switching to Post/wait
So, little y is here to remind you.
Under Oracle 11.2.0.3, it is recommended that you turn off the adaptive log file Sync and make sure that the LGWR process runs under the post/wait mechanism to ensure that there is no serious performance jitter in the database! Close the command as follows, can be modified online!
Parameter reminders that lead to Oracle performance jitter