This is a problem where RMAN is used in the Oracle database for database operations. The backup performance is slow to unacceptable due to the improper configuration in the default configuration.
The entire problem solving process involves storage service providers, networks, operating systems, and Oracle. The solution was complicated and difficult, and even began to doubt itself. At the end of the peak, the key information was found in the output log of RMAN backup, which solved the problem.
This problem is complicated. If we can take a closer look at the log information, it will save a lot of time and manpower, and we will not go around in this big circle.
1. Environment
The customer's database system runs on the Linux RedHat system. The database system is Oracle 10.2.0.5.6, with three-node RAC.
The database name is WEBDB and runs in archive mode. The current database size is about 900 GB, and the amount of archived logs generated every day is about 50 GB, but it also has 100 GB at the highest peak.
Database Backup uses the RMAN tool to back up data to the disk mounted to the server.
2. Problem
During database backup, only about 4 MB can be written per second.
If you back up file 88, the file size is 5 GB.
RMAN> backup datafile 88 format'/u01/app/oracle/88. bk ';
Starting backup at 14-3-12
Using channel ORA_DISK_1
Channel ORA_DISK_1: starting compressed full datafile backupset
Channel ORA_DISK_1: specifying datafile (s) in backupset
Input datafile fno = 00088 name = + DG/webdb/datafile/qlzq50.dbf
Channel ORA_DISK_1: starting piece 1 at 14-3 month-12
Channel ORA_DISK_1: finished piece 1 at 14-3 month-12
Piece handle =/u01/app/oracle/88.bk tag = TAG20120314T113321 comment = NONE
Channel ORA_DISK_1: backup set complete, elapsed time: 00:04:35
Finished backup at 14-3-12
Starting Control File and SPFILE Autobackup at 14-3-12
Piece handle =/dbbackup/webdb/c-3243882293-20120314-04 comment = NONE
Finished Control File and SPFILE Autobackup at 14-3-12
The backup time is 4 minutes 35 seconds and the backup result set is 1.1 GB.
The backup speed is the same for multiple backup tests.
3. Analysis
In the new environment 10.2.0.4, create a database named WEBDB in advance, which is consistent with the name of the original website database.
The database is migrated and the database version is upgraded to 10.2.0.5. The previous version was 10.2.0.4, And the backup was not slow at that time.
The new and old databases use a type of server, which is a 64-core CPU and 32 GB memory. The storage cabinet connected to the new database is the expansion cabinet, while the old database uses the primary cabinet.
We first analyze the session wait events during the RMAN backup process and find that they are always "RMAN backup & recovery I/O" waiting events.
The check method is "select * from v $ session_event a where a. SID = 1450 ;".
That is to say, at the database level, the system operates on the I/O of RMAN.
The system load is very low and stable throughout the process.
We will analyze the I/O in the RMAN backup process. This is analyzed from two aspects: system and database performance view.
On the one hand, we use vmstat-1, iostat-m 1, and the results show that the system IO is very poor, only 5 MB of write per second, the read is slightly larger, about 30-40 MB per second. However, it cannot completely reach the normally stored IO value.
After testing the storage performance, the storage engineers prove that the I/O capability is MB read per second and MB write per second. However, during rman backup, this performance never exists.
On the other hand, the database, we query the performance system view, the method is "select * from v $ backup_async_io where open_time> sysdate-1/2;", we found that the RMAN backup reads 16 mb per second, the write speed is 4 MB per second, which is stable.
When querying oracle rman principles, it is found that under normal circumstances, the rman io of a channel is as fast as this. In addition, writing this sample does not consume much time. (This statement may be incorrect. Please refer to the official RMAN_BUFFER.dbf document)
To this point, we can't see what the backup process is doing at other times from the database information. The backup read/write speed is normal, but the system cannot complete the backup quickly.
Here we use the rman backup tool, and we think of another BACKUP method, rman copy. This kind of COPY is to simply COPY the files in the ASM disk to the file system without any changes.
After testing, it is found that this method is very fast. When backing up a file to a local hard disk, it takes only 55 seconds for a 5 GB file, and even if it is backed up to the storage, it takes only 2 minutes 45 seconds.
The following is a test table.
RMAN copy mode
Node 3 slave to local 55 seconds
Node 2 slave to local 56 seconds
Node 1 backup to local 55 seconds
Node 1 backup to storage 2 minutes 45 seconds
RMAN backup mode
Node 1 slave to local 4 minutes 35 seconds
Node 1 backup to storage 4 minutes 35 seconds
At the database level, after being powerless, we can find the next layer to directly analyze what process operations are doing at the operating system level.
The STRACE analysis process tool in LINUX is used here.
The complete usage of this tool is:
Strace-o/tmp/output2.txt-T-tt-e trace = all-s 12-p 17129
Parameters that must be remembered:
1) The strace-p pid can track a background process.
2) strace-o filename outputs the trace result to the file
3) strace-T records the time spent on each system call. You can see which system call takes a long time.
4) strace-t (or-tt) records the time when each system call occurred (in the format of hour, minute, and second)
5) When strace-s 1024 displays system call parameters, the default length of the string is 32. If the string parameters are long, a lot of information cannot be displayed.
6) strace-e trace = nanosleep only records relevant system call Information. -E trace = network // only records system calls related to network APIs
-E trace = file // only records system calls involving file names
-E trace = desc // only records system calls involving file handles
Others include process, ipc, and signal.
We analyzed the strace results of the COPY and BACKUP methods, and found that the pread and pwrite operation time was not different. On the contrary, PWRITE is less frequently used during BACKUP, but there is not much time difference between each operation. The final result is a 55-second, 4 minutes 35 seconds, with a difference of five to six times.
Test method not used:
1. Create a new database in the WEBRAC environment and test whether it is caused by database configuration to determine whether it is an OS-level configuration or a DB-level configuration problem. However, it is a production environment.
2. Adjust the implicit parameters involved in rman buffer.
For example, the written parameter _ db_file_direct_io_count. However, because it is a production environment, it is highly risky.
3. Set up a test environment and analyze whether the problem is 10.2.0.5. This is the first project to adopt 10.2.0.5. However, if RYDRAC is 10.2.0.5 in the same version, the test results are normal and there is no problem at all. Although RYDRAC mounts a better storage performance than this one.
The time it takes for rman backup to remove normal read and write operations, and what operations are performed between read and write operations? Most of the time is consumed between read and write operations. For example, what operations are performed to read data blocks from the ASM disk to the memory and then write them to the file system disk.
At one time, I thought there were too many CPUs. Will it be a problem with hyper-threading?
I thought this storage was asynchronous I/O?