"Oracle" RAC11GR2 grid startup sequence and start-up fault diagnosis ____oracle

Source: Internet
Author: User

Beginning with the 11GR2, Oracle RAC architecture has changed a lot, the cluster level intersect in the previous version has a relatively large change, the original RAC architecture is basically CSSD, CRSD, EVMD three bare skeleton processes, fewer logs, for RAC can not start the reason, The most original way to view the logs of each process one at a while can also find the cause of the failure to start. However, after the 11GR2, the cluster layer has changed a lot, the following is the directory situation under $grid_home/log/rac1/:

[Grid@rac1 rac1]$ ls

ACFS ACFSREPL acfssec Agent client Crfmond CSSD CVU EVMD gnsd MDNSD

Acfslog acfsreplroot admin alertrac1.log crflogd crsd ctssd diskmon gipcd gpnpd ohasd SRVM

You can see that there are so many folders in this directory that if the RAC doesn't start, it's extremely inefficient to look at the reasons for not starting up in all the logs. So we need a more definitive diagnostic mentality.

OK, let's get down to business and hope to help you with your daily diagnosis.

In the first step, we need to understand the start process of grid in 11GR2 before diagnosing the situation in which the grid does not start, and the following diagram illustrates the starting order of the grid now:

As we can see from the diagram, the 11GR2 has changed a lot compared to the original Oracle 10g cluster architecture. The specific process of the role here no longer repeat, do not understand can do their own evil to make up, here only to say the sequence of the process to start related content. The OHASD process starts first when the cluster is started, and 4 agents are started after the OHASD process starts:

1.CSSD Agent

Starts with the root user and is responsible for starting the CSSD process.

2.orarootagent

Start with root permissions to start the following daemons: CRSD process, CTSSD process, DiskMon process, ACFS process. These processes are also started with the root user privileges.

3.oraagent

Started with grid user rights and is responsible for the MDNSD process, GIPCD process, GPNPD process, EVMD process, ASM process (ASM after 11GR2 is placed to the lower level in the cluster, and the previous version is significantly different).

4.cssdmonitor.

Starts with the root user and is responsible for the start of the cssdmonitor process.

As we can see from the diagram, the CRSD process is responsible for starting up two agent:orarootagent and oraagent (we can see two oraagent processes in the final process, that is, the one that was started, plus this), And then by Orarootagent and oraagent to start the user resources, the process started to here I think the grid at the bottom of the start, and then by the Orarootagent and oraagent start of the resource problems are no longer covered in this article.

In the second step, we've combed the sequence of process launches in the grid, and it's easy to diagnose the grid's inability to start. As long as we get through ps-ef|grep/oracle/app/grid/product/11.2.0 ($GRID _home) we can see where the grid has started, which processes have started, which processes have not been started, which processes are stuck, So we can quickly find the log that should be viewed. For example, if the CRSD process does not start, we can look at the Crsd.log in the $GRID_HOME/LOG/RAC1/CRSD directory to see what errors occurred during the CRSD process startup that prevented the process from starting properly.

Example:

[Grid@rac1 crsd]$ Ps-ef|grep/oracle

Root 15235 1 0 14:12? 00:00:06/oracle/app/grid/product/11.2.0/bin/ohasd.bin reboot

Grid 15356 1 0 14:12? 00:00:00/oracle/app/grid/product/11.2.0/bin/oraagent.bin

Grid 15367 1 0 14:12? 00:00:00/oracle/app/grid/product/11.2.0/bin/mdnsd.bin

Grid 15378 1 0 14:12? 00:00:02/oracle/app/grid/product/11.2.0/bin/gpnpd.bin

Grid 15388 1 2 14:12? 00:00:19/oracle/app/grid/product/11.2.0/bin/gipcd.bin

Root 15390 1 0 14:12? 00:00:00/oracle/app/grid/product/11.2.0/bin/orarootagent.bin

Root 15403 1 0 14:12? 00:00:08/oracle/app/grid/product/11.2.0/bin/osysmond.bin

Root 15477 1 0 14:12? 00:00:02/oracle/app/grid/product/11.2.0/bin/ologgerd-m-D/ORACLE/APP/GRID/PRODUCT/11.2.0/CRF/DB/RAC1

Root 15637 1 0 14:22? 00:00:00/oracle/app/grid/product/11.2.0/bin/cssdmonitor

Root 15665 1 0 14:22? 00:00:00/oracle/app/grid/product/11.2.0/bin/cssdagent

Grid 15676 1 0 14:22? 00:00:00/oracle/app/grid/product/11.2.0/bin/ocssd.bin

Grid 15730 13826 0 14:27 pts/1 00:00:00 grep/oracle

From the above output we can see that the grid cannot start because the CSSD process could not start, so we looked directly at Ocssd.log to see why it could not be started, and found the following in the log:

2016-05-09 14:30:26.476: [Cssd][1104030016]clssnmvdhbvalidatencopy:node 2, RAC2, has a disk HB, but no network HB , DHB has rcfg 358258450, wrtcnt, 177436, LATS 10923264, lastseqno 177435, uniqueness 1462763679, timestamp 146277542 6/10874194

Can see is because the private network appears the problem, the export has the disk HB, but does not have network HB, repairs the private network problem, the cluster can start normally.

The third step comes with a MOS article: ID 1623340.1, which lists the common causes of the failure of the grid processes to start and the corresponding log: 1.1.1. Cluster status


Querying the status of the cluster and daemon:

$GRID _home/bin/crsctl Check CRS
Crs-4638:oracle High Availability Services are online
Crs-4537:cluster Ready Services is online
Crs-4529:cluster Synchronization Services is online
Crs-4533:event Manager is online

$GRID _home/bin/crsctl stat res-t-init
--------------------------------------------------------------------------------
NAME TARGET State SERVER State_details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
Ora.asm
1 Online Online Rac1 started
Ora.crsd
1 Online Online Rac1
Ora.cssd
1 Online Online Rac1
Ora.cssdmonitor
1 Online Online Rac1
Ora.ctssd
1 Online online Rac1 OBSERVER
Ora.diskmon
1 Online Online Rac1
Ora.drivers.acfs
1 Online Online Rac1
Ora.evmd
1 Online Online Rac1
Ora.gipcd
1 Online Online Rac1
Ora.gpnpd
1 Online Online Rac1
Ora.mdnsd
1 Online Online Rac1

for 11.2.0.2 and above, the following two additional processes are available:

Ora.cluster_interconnect.haip
1 Online Online Rac1
Ora.crf
1 Online Online Rac1

For Exadata systems above 11.2.0.3, Ora.diskmon will be in the offline state, as follows:

Ora.diskmon
1 OFFLINE OFFLINE Rac1

for versions above 12c, Ora.storage resources will appear:

Ora.storage
1 Online Online racnode1 stable



If the daemon offline we can start with the following command:

$GRID _home/bin/crsctl Start Res ora.crsd-init

1.1.2. Problem 1:OHASD cannot be started


Since Ohasd.bin's responsibility is to start all other processes directly or indirectly, only this process starts normally, other processes get up, and if the ohasd.bin process does not rise, the error CRS-4639 when we check the state of the resource (could not Contact Oracle high Availability Services); If the Ohasd.bin is already started and the error CRS-4640 occurs when you try to reboot again, the following error message will be seen if it fails to start:

Crs-4124:oracle High Availability Services startup failed.
Crs-4000:command Start failed, or completed with errors.



The automatic start Ohasd.bin depends on the following configuration:

1. the operating system is configured with the correct run level:

The OS needs to be set to the specified run level before the CRS is started to ensure the normal start of CRS.

We can find the run level for CRS requiring OS settings in the following ways:

Cat/etc/inittab|grep INIT.OHASD
H1:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null



The above example shows that CRS requires the OS to run at run Level 3 or 5; Note that the operating level of the OS required for CRS startup is different because of the OS.

Locate the run level on which the current OS is running:

Who-r



2. "INIT.OHASD Run" launch

On the Linux/unix platform, because "INIT.OHASD run" is configured in/etc/inittab, process init (process ID 1,linux,solars and HP-UX is/sbin/init, and Aix is/usr/sbin/ INIT) starts and generates the "INIT.OHASD run" process, and if the process fails, there will be no "init.ohasd run" to start and run, and Ohasd.bin will not boot:

Ps-ef|grep init.ohasd|grep-v grep
Root 2279 1 0 18:14? 00:00:00/BIN/SH/ETC/INIT.D/INIT.OHASD Run

Note: Oracle Linux (OL6) and Red Hat Linux 6 (RHEL6) no longer support inittab, so INIT.OHASD will be configured in/etc/init and started by/etc/init, however, we Should be able to see the process "/ETC/INIT.D/INIT.OHASD run" was started;

If any of the RC Snncommand scripts (in rcn.d, such as S98gcstartup) are hanged during startup, the Init process may not start "/etc/init.d/init.ohasd run"; you need to seek OS vendors to find out why the Snncommand script hangs or fails to start properly;

Error "[OHASD (<pid>)] crs-0715:oracle high Availability Service has timed out waiting for INIT.OHASD to be started." May Occurs after the INIT.OHASD cannot be started within a specified time

If a system administrator cannot find out why INIT.OHASD cannot start in the short term, the following methods can be used as a temporary workaround:

CD <location-of-init.ohasd>
Nohup./INIT.OHASD Run &




3. Clusterware automatic startup;--automatic startup is turned on by default

By default, the CRS autostart is turned on, and we can open it in the following ways:

$GRID _home/bin/crsctl Enable CRS


Check to see if this feature is turned on:

$GRID _home/bin/crsctl config CRS


If the following information is exported in the OS log

Feb 16:20:36 racnode1 logger:oracle Cluster Ready Services startup disabled.
Feb 16:20:36 racnode1 logger:could not access/var/opt/oracle/scls_scr/racnode1/root/ohasdstr


The reason for this is that the file does not exist or is inaccessible, and the reason for this problem is generally the use of the wrong opatch in the process of artificial modification or GI patches (e.g. using Opatch on the Solaris platform to patch on Linux)


4. syslogd boot and OS can execute init script S96OHASD

After the node is started, the OS may stagnate on some other Snn script, so there may not be a chance to execute to the script s96ohasd; If this is the case, we will not see the following information in the OS log

(Aix/var/adm/syslog linux/var/log/messages)

20:46:51 Rac1 logger:oracle HA daemon is enabled for autostart.


If the above information is not visible in the OS log, there is another possibility that syslogd (/USR/SBIN/SYSLOGD) is not fully booted. GRID does not start properly in this case, which is not applicable to the AIX platform.

To understand whether the S96OHASD script can be executed after OS startup, you can modify the script in the following ways:

From:

Case ' $CAT $AUTOSTARTFILE ' in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."


To:

Case ' $CAT $AUTOSTARTFILE ' in
enable*)
/bin/touch/tmp/ohasd.start. " ' Date '
$LOGERR "Oracle HA daemon is enabled for autostart."


After restarting the node, if you do not see the file/tmp/ohasd.start.timestamp being created, then the OS is stuck on other Snn scripts. If you can see/tmp/ohasd.start.timestamp generated, but "Oracle HA daemon is enabled for Autostart" has not been written to the messages file, SYSLOGD is not fully booted. In both cases, you need to seek the help of your system administrator to find the cause of the problem from the OS level, and for the latter, a temporary solution is to "hibernate" for 2 minutes, modifying the OHASD script in the following ways:

From:

Case ' $CAT $AUTOSTARTFILE ' in
enable*)
$LOGERR "Oracle HA daemon is enabled for autostart."


To:

Case ' $CAT $AUTOSTARTFILE ' in
enable*)
/bin/sleep 120
$LOGERR "Oracle HA daemon is enabled for autostart."


5.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.