This is the case of a customer in February March. The reason is that after the server hardware is replaced, the Business Translation occasionally experiences slow submission. Let's take a look at AWR.
We can see that the load profile information of the system is actually not high, only 21 transactions per second. Let's take a look at top5events:
From the top 5 events, we can find that the AVG wait of Log File sync is very high, as high as 124 Ms. You should know that for the vast majority of cases
The average wait time of Log File sync is less than 5 ms, which is a little too high.
We know that there are many reasons for the log file sync wait. Regarding Log File sync, tanel poder has written a very good PDF, so you can take the lead.
Here I mainly reference the map of the great god to briefly describe the possible causes of Log File sync. First, let's look at the feedback from the front-end process submission to the final feedback, and the entire process of intermediate processing:
From this, we can clearly see the entire process. Here we can briefly describe:
1. When a user initiates a commit;
2. The frontend process (that is, the server process) will post a message to the lgwr process and tell it that you should write redo buffer.
3. When the lgwr process is instructed, it starts to call the operating system function for physical writing. During this time
Log File parallel write is waiting. Some people may wonder why there was only one lgwr process in 12C, but it was parallel.
What about write? Here, we need to note that the lgwr process writes the data in the redo buffer to the log file in batch mode.
Process (in fact, the dbwn process is also in the batch mode), with related implicit batch number control.
4. After lgwr completes the wrtie operation, the lgwr process will return a message to the front-end process (server process), telling it that I have completed writing,
You can complete the submission.
5. the user completes the commit operation.
Here, we will add that this is because of the Oracle log write priority principle. If the redo buffer entry information is not immediately written to redo before commit
In log file, if the database has crash, data will be lost.
From the preceding flowchart, we can see that log file sync and Log File parallel write are interrelated. In other words, if the log file parallel write takes a long time, the log file sync wait time will be extended.
If the log file parallel write wait is very high, it may be a physical disk Io problem, for example:
From being able to release, assuming that the lgwr process takes too long to complete Io operations, the log file parallel write waits to rise.
In fact, the entire process of sending a commit to the completion of the commit involves many links, not only physical Io will affect Log File sync/log file parallel write. CPU also affects Log File sync and Log File parallel write. Let's take a look at the figure:
We can see that the four links in the above process involve CPU scheduling. Assuming that the system CPU is extremely tight during the entire transaction commit process, this may cause the lgwr process to fail to obtain the CPU and wait in queue. Obviously, this will inevitably cause Log File sync or log file parallel write to wait.
.
Note: Oracle can also control the process to get the CPU priority by implicit arguments _ high_priority_processes. In a system with a relatively low CPU, you can set the number of workers to mitigate this problem.
Finally, let's go back to this case. We can eliminate the CPU problem in the customer's environment. The biggest suspect may be the storage problem, resulting in a very slow Io. However, in fact, this can also be ruled out. In fact, we should note the top 5 events, average Log File parallel write wait
The time is not high. If it is a storage Io problem, the average wait time of this event should be higher than that of the queue.
We can see that the Waits of Log File sync and Log File parallel write are almost the same. However, the AVG wait time of Log File parallel write is only 4 ms, which is a normal value. That is to say, we can eliminate the storage Io problem.
So what is the problem? We use the script provided by Oracle MOs to query the distribution of Log File sync and Log File parallel write WAITS:
?
123456789101112131415161718192021222324252627282930 |
INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT ---------- ---------------------------------------- --------------- ---------- 1 log file sync 1 259306 1 log file sync 2 2948999 1 log file sync 4 1865918 1 log file sync 8 173699 1 log file sync 16 43194 1 log file sync 32 6095 1 log file sync 64 1717 1 log file sync 128 2458 1 log file sync 256 5180 1 log file sync 512 9140 1 log file sync 1024 558347 1 log file parallel write 1 5262 1 log file parallel write 2 4502377 1 log file parallel write 4 1319211 1 log file parallel write 8 46055 1 log file parallel write 16 23694 1 log file parallel write 32 3149 1 log file parallel write 64 283 1 log file parallel write 128 267 1 log file parallel write 256 157 1 log file parallel write 512 73 1 log file parallel write 1024 42 1 log file parallel write 2048 39 1 log file parallel write 4096 103 1 log file parallel write 8192 21 1 log file parallel write 16384 22 1 log file parallel write 32768 190 1 log file parallel write 65536 1 |
You can simply calculate the log file sync and Log File parallel write wait events. The average wait time of about 99% is
It is normal if the value is less than or equal to 4 ms. However, in a few cases, the waiting time is very long. For example, the maximum waiting time of Log File sync is
The wait time is as high as 1 second, because the occasional wait is very high, so the average wait time of the entire log file sync is increased.
In the end, the problem is clearer than the problem. I think this is because the link between the host and storage may be abnormal or unstable. Temporary solution
This problem is solved by moving the redo logfile to the local disk.
Note: After confirmation from the customer, the storage fiber optic cable interface is indeed loose. Haha