Old cainiao struggling oracle asm

Source: Internet
Author: User
Tags dell r610

Old cainiao struggling oracle asm

Author: Tian Yi (sery@163.com) from http:// B .formyz.org/2011/0726/46.html

Application Environment Description 1, hardware 1, server: 2 dell r610-16G memory, 2 6-core xeon cpu, 2 146 GB sas disk, raid12, storage: dell MD3220 24 5.52 GB hard drive 3. Storage connection: 6 GB memory card, two channels connected 2. Software 1. System: 64-bit centos, system kernel version: linux rac1 2.6.18-194. el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux3, asm software: oracleasm-2.6.18-194.el5-2.0.5-1.el5.x86_64.rpm, oracleasm-support-2.1.7-1.el5.x86_64.rpm, oracleasmlib-2.0.4-1.el5.x86_64.rpm4, database software: linux. x64_11gR2_databas E_1of2.zip1_linux.x64_11gr2_database5of2.zip 5. Cluster software: linux.x64_11gR2_grid.zip fault description 1. fault before the fault: 1. The cluster instance runs normally. 2. asm can use asmcmd to view directories and files. 3. The database instance is normal. 4. The Listener is normal. 5. The client is remotely connected. 6. The multi-path access is normal. 7./dev all files in the/oracleasm/disks directory exist. 2. cause of failure: 1. Intention to simulate Server failure 2. Directly restart two servers init 6. Fault symptom: 1. The asm instances of the two servers are not started successfully. 2. The oracle instances of the two servers are not started successfully. 3. processes such as Crs are started, but it basically cannot work normally. 4. manually start crs, fail. 5. Use the grid user to manually connect to the instance and force start, the basic cause of failure determines that the database data files and ocr files required by the cluster software are stored in the shared storage set by asm. The cluster software (including the asm instance) fails to be started and is eventually exported. Failed to start the database instance. Process 1. Locate the fault point: 1. Check the system process and find that the ASM process is unavailable. But there are a few grid-related processes, as shown in:

2. manually execute/u01/app/grid/bin/crsctl start crs failed. 3. Check the device file directory/dev/mapper and find all the partitions in the shared storage exist.

[Root @ rac2 ~] # Ll/dev/mapper/total 0crw ------- 1 root root10, 63 Jul 24 controlbrw-rw ---- 1 root disk 253,0 Jul 24 mpath13brw-rw ---- 1 root disk 253, 10 Jul 24 mpath13p1brw-rw ---- 1 root disk 253, 11 Jul 24 mpath13p2brw-rw ---- 1 root disk 253, 12 Jul 24 mpath13p3brw-rw ---- 1 root disk 253, 13 Jul 24 mpath13p5brw-rw ---- 1 root disk 253, 14 Jul 24 mpath13p6brw-rw ---- 1 root disk 253, 15 Jul 24 mpath13p7brw-rw ---- 1 root disk 253, 16 Jul 24 mpath13p8brw-rw ---- 1 root disk 253,1 Jul 24 mpath14brw-rw ---- 1 root disk 253,3 Jul 24 mpath14p1brw-rw ---- 1 root disk 253,4 Jul 24 mpath14p2brw-rw ---- 1 root disk 253,5 Jul 24 mpath14p3brw-rw ---- 1 root disk 253,6 Jul 24 mpath14p5brw-rw ---- 1 root disk 253,7 Jul 24 mpath14p6brw-rw ---- 1 root disk 253,8 Jul 24 mpath14p7brw-rw ---- 1 root disk 253,9 Jul 24 mpath14p8brw-rw ---- 1 root disk 253,9 Jul 24 ---- 1 root disk, 2 Jul 24 mpath15

4. We initially suspected that the asm disk was faulty. Therefore, we executed oracleasm listdisks and found that there was only one output row, which could actually be more than 10 rows.

[Root @ rac2 ~] # Oracleasm listdisksDATA06

5. Run the oracleasm scandisks scan to check whether there is only one output disk. I checked the information on the Internet. This is also the case. In other people's experience, the asm disk scan is performed several times, but this is not the case for me. 6. When we initialize and create an asm disk group, run the command oracleasm createdisk OCR1/dev/mapper/mpath14p1 to generate the OCR1 file in the/dev/oracleasm/disks directory, the file name is the name of the asm disk. When you create an asm disk, there will be files of the same name. Go to the/dev/oracleasm/disks directory and check that there is only one block Device File DATA06, and all others are missing.

[Root @ rac2 ~] # Ll/dev/oracleasm/disks/total 0brw-rw ---- 1 grid asmadmin 8, 22 Jul 24 00:01 DATA06

The results are exactly the same as those output by oracleasm listdisks. We can see that the asm disk group is directly associated with the operating system directory. The relationships between these users can be identified as follows:

The current task is to restore the remaining disk files. Ii. solution 1: search online first and find many items. We recommend that you use dd to clear the disk and then use oracleasm createdisk to recreate the disk group. But I am worried that all the data in the asm disk will be lost. 2. I called my former colleague and asked him to change to dba. He told me to check whether the permission of the device file is correct. I went to the server and found that the owner of a machine's/dev/mapper/mpath * is root: disk, the directory owner of another server is grid: oinstall. According to his suggestion, I am in the file/etc/rc. local writes the line "chown-R grid: oinstall" and then restarts the system. Check whether the directory owner is changed to grid: root as per my needs, but the asm disk is still unrecognizable, it seems that the problem is not here. 3. Send messages in multiple QQ groups. If you have permission problems, we recommend that you use oracleasm scandisk. Some people think that the oracleasm service is not running. Some people say that the permission of the bare device is incorrect. I didn't use the bare device. open the file/etc/sysconfig/rawdevices, and there is no valid text line (all commented on ), no bare device is used for this diagnosis. After several days, there was no progress. I slept in bed one day and remembered this problem. Since/dev/oracleasm/disks has the DATA06 file, can I create some manually (that is, the lost files) and then I got up again. However, when I manually run touch/dev/oracleasm/disks/DATA08, the system prompts that I have no permission. It seems that only commands such as mknod can be used to create files in this directory. After thinking about it, I thought: Since the DATA06 file exists, it may be recorded in some files, right? Execute grep DATA06/etc-r full path search and find a file/etc/blkid. tab. The file is as follows:

[Root @ rac1 ~] # More/etc/blkid. tab <device DEVNO = "0x0807" TIME = "1311517843" LABEL = "/u011" UUID = "0aa29b92-8132-4749-b016-5425e1a7bcc6" TYPE = "ext3" SEC_TYPE = "ext2">/dev/ sda7 </device> <device DEVNO = "0x0806" TIME = "1311517843" LABEL = "/tmp1" UUID = "e6370305-9868-4190-a256-b9bc84c07b75" TYPE = "ext3" SEC_TYPE = "ext2" >/dev/sda6 </device> <device DEVNO = "0x0805" TIME = "1311517843" LABEL = "/usr1" UUID = "d54e800b-b1c8-4d0f-9095-955122dc5ff1" TYPE = "ext3" SEC_TYPE = "ext2">/dev/sda5 </device> <device DEVNO = "0x0803" TIME = "1311517843" LABEL = "/var1" UUID = "ee60d33f-0c2b-4e80-93ff-f442c99f6dcf" TYPE = ""ext3" SEC_TYPE = "ext2">/dev/sda3 </device> <device DEVNO = "0x0802" TIME = "1311517849" TYPE = "swap" LABEL = "SWAP-sda2 ">/dev/sda2 </device> <device DEVNO =" 0x0801 "TIME =" 1311517843 "LABEL ="/1 "UUID =" 08e3c9b3-3cfe-403c-9cd8-ff9e007602fd "TYPE =" ext3" SEC_TYPE = "ext2">/dev/sda1 </device> <device DEVNO = "0x0000" TIME = "1311517849" TYPE = "swap">/u01/swapfile </ device> <device DEVNO = "0xfd10" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/mapper/mpath14p8 </device> <device DEVNO = "0xfd09 "TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/mapper/mpath15p8 </device> <device DEVNO =" 0xfd0e "TIME =" 1311441535 "LABEL =" "DATA06" TYPE = "oracleasm">/dev/mapper/mpath14p6 </device> <device DEVNO = "0xfd0d" TIME = "1311441535" LABEL = "" TYPE = "oracleasm" >/dev/mapper/mpath14p5 </device> <device DEVNO = "0xfd06" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/mapper/mpath15p5 </device> <device DEVNO = "0xfd0c" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/mapper/mpath14p3 </device> <device DEVNO =" 0xfd05 "TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/mapper/mpath15p3 </device> <device DEVNO =" 0xfd0b "TIME =" 1311441535 "LABEL = "" TYPE = "oracleasm">/dev/mapper/mpath14p2 </device> <device DEVNO = "0xfd04" TIME = "1311441535" LABEL = "" TYPE = "oracleasm" >/dev/mapper/mpath15p2 </device> <device DEVNO = "0xfd0a" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/mapper/mpath14p1 </device> <device DEVNO = "0xfd03" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/mapper/mpath15p1 </device> <device DEVNO =" 0x0811 "TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/sdb1 </device> <device DEVNO =" 0x0812 "TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/sdb2 </device> <device DEVNO =" 0x0813 "TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/sdb3 </device> <device DEVNO =" 0x0815 "TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/sdb5 </device> <device DEVNO = "0x0816" TIME = "1311441535" LABEL = "DATA06" TYPE = "oracleasm">/dev/sdb6 </device> <device DEVNO = "0x0818" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/sdb8 </device> <device DEVNO = "0x0821" TIME =" 1311441535 "LABEL =" "TYPE =" oracleasm ">/dev/sdc1 </device> <device DEVNO =" 0x0822 "TIME =" 1311441535 "LABEL =" "TYPE = ""oracleasm">/dev/sdc2 </device> <device DEVNO = "0x0823" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/sdc3 </device> <device DEVNO = "0x0825" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/sdc5 </device> <device DEVNO = "0x0828" TIME = "1311441535" LABEL = "" TYPE = "oracleasm">/dev/sdc8 </device> <device DEVNO = "0x0841" TIME =" 1311441536 "LABEL =" "TYPE =" oracleasm ">/dev/sde1 </device> <device DEVNO =" 0x0842 "TIME =" 1311441536 "LABEL =" "TYPE =" "oracleasm">/dev/sde2 </device> <device DEVNO = "0x0843" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/sde3 </device> <device DEVNO = "0x0845" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/sde5 </device> <device DEVNO = "0x0846" TIME = "1311441536" LABEL = "DATA06" TYPE = "oracleasm">/dev/sde6 </device> <device DEVNO = "0x0848" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/sde8 </device> <device DEVNO = "0x0851" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/sdf1 </device> <device DEVNO = "0x0852" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/ sdf2 </device> <device DEVNO = "0x0853" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/sdf3 </device> <device DEVNO = "0x0855" TIME = "1311441536" LABEL = "" TYPE = "oracleasm">/dev/sdf5 </device> <device DEVNO = "0x0858" TIME = ""1311441536" LABEL = "" TYPE = "oracleasm">/dev/sdf8 </device>

The output shows some clues. If the label is empty, the asm disk is lost. Based on this idea, I manually changed the corresponding three lines in this file to make the label = "DATA08 ". DATA08 was created with oracleasmcreatedisk and reserved. Because there is no data on this DATA08 disk, it is damaged and does not matter. Then execute oracleasm scandisks. Then there is no change in oracleasm listdisks. It seems that this trick is not feasible. Finally, restart the system to see if it works. It was later known that the content of the/etc/blkid. tab file is automatically generated after you run blkid and read data from the system directory/dev. Do I have to push it again? Unwilling! I am not afraid of data loss, but worried that this problem will happen again next time. Before you start, try to create an asm disk. You can still use the unused/dev/mapper/mpath14p8 partition. If it is broken, it will have no effect. When oracleasm createdisk DATA08/dev/mapper/mpath14p8 is executed, the result is:

Device "/dev/mapper/mpath14p8" is already labeled for ASM disk ""

This output gives me a good prompt, which indicates that the asm disk tag exists, but its value is null (its original value should be DATA08 ). So I thought about it. Can I forcibly change it from null to the original value? I don't know how to change the oracle asm disk label. Open the file/etc/init. d/oracleasm. You can find it. See the following function:

Force_relabel_disk () {OLD = "$1" NEW = "$2" echo-n "Renaming disk \" $ {OLD} \ "to \" $ {NEW }\": "" $ {ORACLEASM} "renamedisk-f-v-l" $ {ORACLE_ASMMANAGER} "" $ {OLD} "\" $2 "1>/var/log/oracleasm 2> & 1 if_fail "$? "" Unable to rename disk \ "$ {OLD} \" see/var/log/oracleasm "}

This line in red is mandatory to change the syntax of the asm disk tag. You can't wait to execute oracleasm renamedisk-f/dev/mapper/mpath14p8 DATA08 right away. It went on smoothly. Switch to the/dev/oracleasm/disks directory, and the device file DATA08 appears, so I am writing it! Run oracleasm scandisk on another host, and then run oracleasm listdisks. It can be predicted that as long as the corresponding asm disk is forcibly renamed according to the previous tag name, it should be able to be restored. First, do not rush to restore all the asm disk tags, starting from the disk tag where the ocr is located. In this case, ocr uses two asm disks named OCR1 and OCR2 (thanks to the screen recording when rac was previously installed). Execute the following two commands:

Oracleasm renamedisk-f/dev/mapper/mpath14p1 OCR1oracleasm renamedisk-f/dev/mapper/mpath15p1 OCR2

After successful execution, use oracleasm scandisks to scan and check the directory/dev/oracleasm/disks. The files OCR1 and ORC2 both exist. Check the output of oracleasm listdisks. OCR1 and OCR2.Ocr are indeed restored, try to start crs. run the root command to run crsctl start crs. After the carriage return, the system becomes silent and the system will return to the toilet. After the command is executed, return to the shell prompt. Run ps auxww | grep-I asm to check the process. The asm instance is indeed up. Switch to the grid user, execute asmcd, and enter the interaction mode smoothly. The ASMCMD> ls output is:

ASMCMD> lsDGCRS/

Note: DGCRS is a combination of OCR1 and OCR2. Start crs on another server, and the asm instance runs normally. Now, you can force the remaining asm tags with peace of mind. After the directory is completed, the file day is:

[Root @ rac1 dev] # ll/dev/oracleasm/diskstotal 0brw-rw ---- 1 grid oinstall 8, 18 Jul 26 17:05 DATA02brw-rw ---- 1 grid oinstall 8, 19 Jul 26 DATA03brw-rw ---- 1 grid oinstall 8, 21 Jul 26 DATA05brw-rw ---- 1 grid oinstall 8, 22 Jul 24 DATA06brw-rw ---- 1 grid oinstall 8, 24 Jul 26 DATA08brw-rw ---- 1 grid oinstall 8, 34 Jul 26 DATA12brw-rw ---- 1 grid oinstall 8, 35 Jul 26 DATA13brw-rw ---- 1 grid oinstall 8, 37 Jul 26 DATA15brw-rw ---- 1 grid oinstall 253, 14 Jul 24 DATA16brw-rw ---- 1 grid oinstall 8, 40 Jul 26 DATA18brw-rw ---- 1 grid oinstall 8, 17 Jul 26 OCR1brw-rw ---- 1 grid oinstall 8, 33 Jul 24 OCR2

After confirming that the database is correct, contact the relevant personnel to notify them that the database is to be started. Check the ORACLE_SID and asm disk labels again. Take a deep breath and enter/u01/app/grid/bin/srvctl start database-d DD4QIGOU and press Enter, get up and choose to have more fresh oranges (it is estimated that there are plasticizer ). It is estimated that the database is almost started. Check back and check that all oracle instances are working properly. However, one thing is that the servers exchange their own instances (the running instance of rac1 is db4qigou_2, and the running instance of rac2 is db4qigou_1). This doesn't matter. Disable various instances, run $ srvctl start instance-d DB4QIGOU-I DB4QIGOU_1-n rac1 on rac1 and srvctl start instance-d DB4QIGOU-I DB4QIGOU_2-n rac2 on rac2. Supplement: We recommend that you modify the scanning sequence of the/etc/sysconfig/oracleasm file on the oracle official site. However, the scan sequence is invalid in this example.

This article is from the sery blog, please be sure to keep this source http://sery.blog.51cto.com/10037/624008

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.