The Application System life cycle is a whole. In addition to the initial demand research, development, testing, and launch, the longer period is in the O & M aspect. The value of an application system is embodied in the O & M stage. A system O & M environment that often reports errors and faults is difficult to obtain a good user experience.
In practice, if there is no sound communication between software developers and O & M personnel, the new system will not be easily integrated into the original O & M system. In addition, many other faults may occur. This article describes a disk space failure caused by a backup policy conflict.
1. Environment Introduction and faults
I recently received a system and went online for more than one year. During the handover, the business department reported that the disk space was full. At that time, the whole system was paralyzed, and finally the problem was solved by contacting the developer. However, at that time, the feedback was not completely solved, and developers could only find the solution on a regular basis.
Due to limited information channels, I can only observe and analyze the data on the spot. The database server version is Red Hat Linux 6.2 and the database version is 11.2.0.3.
[Root @ DB ~] # Cat/etc/RedHat-release
Red Hat Enterprise Linux Server release 6.1 (Santiago)
SQL> select * from v $ version;
BANNER
---------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0-64bit Production
PL/SQL Release 11.2.0.3.0-Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0-Production
NLSRTL Version 11.2.0.3.0-Production
The fault is related to the disk space, so the current disk status df is as follows.
[Root @ DB ~] # Df-h
Filesystem Size Used Avail Use % Mounted on
/Dev/sda3 59g 8.4G 48G 15%/
Tmpfs 3.9G 288 K 3.9G 1%/dev/shm
/Dev/sda2 194 M 41 M 143 M 23%/boot
/Dev/sda1 200 M 256 K 200 M 1%/boot/efi
/Dev/sda8 1.4 T 351G 976G 27%/data
/Dev/sda4 59G 23g 34G 40%/home
/Dev/sda5 59G 180 M 56G 1%/tmp
/Dev/sda6 59G 5.9G 50G 11%/var
System space distribution is typical, and resources are relatively rich. The maximum capacity partition/data directory contains nearly 351 TB of data and uses GB. From the oracle user environment variables, the database software is installed in the/home folder, and the data file is in/data.
[Oracle @ DB]/home/oracle> env | grep ORA
ORACLE_BASE =/home/oracle/app
ORACLE_HOME =/home/oracle/app/product/11.2.0/db_1
ORACLE_OWNER = oracle
ORACLE_SID = db
The shema data volume in the business system is very small, only 77 MB. According to business analysis, the system's business data is only stored in the database, and there is no deletion mechanism. In this case, the probability of disk space being full due to the sudden expansion of business data is very low.
The analysis focuses on how the space consumption of/data exceeds GB?
2. Problem Analysis
Go to the/data directory and find that the application backs up RMAN in this directory.
[Root @ DB rman] # pwd
/Data/db/rman
[Root @ DB rman] # ls-l
Total 1312
Drwxr-xr-x. 2 oracle oinstall 409600 Mar 7 bak
-Rw-r --. 1 oracle oinstall 0 Aug 21 2013 get
Drwxr-xr-x. 2 oracle oinstall 921600 Mar 7 logs
-Rwxr-x ---. 1 oracle oinstall 1037 Jul 1 2013 rman_full.sh
Obviously, the/data/db/rman directory is the internal backup mechanism of the application system. At present, many systems have their own database backup modules. From now on, the system plans to use the RMAN program for backup.
The rman_full.sh script in the directory is mainly used to execute the script.
[Root @ DB rman] # cat rman_full.sh
#! /Bin/ksh
# Set env
(Space reasons, omitted ......)
$ BIN/rman log $ BACKUP_LOG/$ TARGET_SID.full. $ DATE_3.log <EOF
Connect target/
Run {
Allocate channel c1 type disk;
Allocate channel c2 type disk;
Backup full database format' $ BACKUP_PATH/$ {DATE_2} _ full _ % d _ % s _ % p _ % u. bak'
Tag = 'full' include current controlfile;
SQL 'alter system archive log current ';
Backup archivelog all format' $ BACKUP_PATH/$ {DATE_2} _ archivelog _ % d _ % s _ % p _ % u. bak ';
Delete noprompt expired backupset of archivelog all;
Release channel c1;
Release channel c2;
}
Crosscheck backup;
Delete noprompt expired backup;
Delete noprompt obsolete;
Exit;
EOF
From a fair perspective, this script does not have any problems. Set environment variables, directory locations, and back up databases and archive files. Then perform crosscheck to check the expired backup information, and delete the expired logs according to the obsolete retention principle.
The bak in the directory structure stores the backup set (although the control file is left in $ ORACLE_HOME/dbs), and the logs directory is a text log. After entering the bak directory, check the backup status.
[Root @ DB bak] # ls | more
20130719_archivelog_db_rj189_1_k5of3j4s.bak
20130719_archivelog_db_1_1__1_k6of3j4t.bak
20130719_full_db_0000180_0000jsof3j1b.bak
20130719_full_db_rj186_1_k2of3j4d.bak
20130720_archivelog_db_2017258_1_maof64d1.bak
20130720_archivelog_db_2017259_1_mbof64d2.bak
20130720_full_pdb_255.255_1_m7of64cn.bak
(Space reason, omitted)
20140307_full_db_1151__127d3p2ho2g.bak
20140307_full_db_1151__1_d4p2ho2g.bak
20140307_full_db_1151__1_d5p2ho47.bak
201401171422. dmp
Full_20130720.tar.gz
Rm
Note: the time and date in the backup slice are in it. The backup set exists since January 1, July 2013. The total data volume is 300 GB.
[Root @ DB bak] # du-h
301G.
This is obviously a problem. In the rman backup script, there is a clear delete obsolete statement to delete unnecessary backup sets. Confirm that the obsolete rule is visible from show all.
RMAN> show all;
RMAN configuration parameters for database with db_unique_name DB are:
Configure retention policy to recovery window of 7 DAYS;
Configure backup optimization off; # default
Configure default device type to disk; # default
Configure controlfile autobackup on;
The retention window policy is 7 days. In reality, the contents in the bak directory obviously exceed this range. Long-term backup retention will increase the space occupied by bak, so that even if the/data directory has a size of TB, It will be fully occupied.
After confirming the reason for the full backup, you need to confirm why the backup was not successfully deleted when Oracle executed the RMAN script of the application. Find the daily log information in the logs directory and find the answer.
[Root @ DB logs] # tail-n 20 db. full.20140307010102.log
Backup Piece 73678 27-FEB-14 c-1778314713-20140227-02
Backup Set 73679 feb-14
Backup Piece 73679 27-FEB-14 a5p1mv7i_1_1
Backup Set 73680 feb-14
Backup Piece 73680 27-FEB-14 a6p1mv8m_1_1
Backup Set 73681 feb-14
Backup Piece 73681 27-FEB-14 c-1778314713-20140227-03
Backup Set 73684 28-FEB-14
Backup Piece 73684 28-FEB-14/data/awpdb/rman/bak/20140228_full_PDB_115018_1_aap1ncc6.bak
Backup Set 73685 28-FEB-14
Backup Piece 73685 28-FEB-14/home/oracle/app/product/11.2.0/db_1/dbs/c-1778314713-20140228-00
RMAN-00571: ========================================================== ==============================
RMAN-00569: ==================== error message stack follows ==========================
RMAN-00571: ========================================================== ==============================
RMAN-03002: failure of delete command at 03/07/2014 01:02:57
RMAN-06091: no channel allocated for maintenance (of an appropriate type)
An issue occurred while deleting the obsolete, and the RMAN-06091 indicates that a problem occurred while allocating the channel. The root cause is that an error is reported when the obsolete is deleted. For a long time, the script cannot successfully Delete the expired backup, and the backup file is full of the file system.
So why is this problem?
For more details, please continue to read the highlights on the next page: