Oracle Grid infrastructure:understanding split-brain Node Eviction (document ID 1546004.1)

Source: Internet
Author: User

In this Document

Purpose
Scope
Details
What does "split brain" mean?
Why are this a problem?
How does the Clusterware resolve a "split brain" situation?
Identifying a Split-brain Eviction
Finding the cohort
Understanding the cohort message
Using the cohort message to identify interconnect network issues
Follow-up Action
Community Discussions

References

Applies To:

Oracle database-enterprise edition-version 11.2.0.1 and later
Information in this document applies to any platform.

PURPOSE

The purpose of this note are to explain Split-brain node evictions in Oracle clusterware release 11.2

SCOPE

The intended audience of this note are Oracle Clusterware 11.2 administrators at any level of expertise. As written, this note applies only to 11.2.

DETAILS

Missed Network Heartbeat (NHB) evictions happen when OCSSD of the surviving node loses contact with the evicted node-over The interconnect. The nodes must is able to communicate over the interconnect to avoid a "split brain" situation. In the case of a ' split brain ' node eviction, one node aborted itself to avoid ' split brain ' when communication over the I Nterconnect was compromised.

What does "split brain" mean?

"Split brain" means that there is 2 or more distinct sets of nodes, or ' cohorts ', with no communication between the C Ohorts.

For example:
Suppose there is 4 nodes named A, B, C, D, in the following situation
* Nodes A, B can talk the other; Nodes c,d can talk to all other
* But A and B cannot talk to C or D, and vice versa
Then there is cohorts: {A, B} and {C, D}.

Why are this a problem?

In a split-brain situation, there is in a sense of both (or more) separate clusters working on the same shared storage. This is the potential for data corruption. So the split-brain must is resolved.

How does the Clusterware resolve a "split brain" situation?

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort.
If both of the cohorts is the same size, the cohort with the lowest numbered node in it survives.

The Clusterware identifies the largest cohort, and aborts all the nodes which does not belong to that cohort.

Identifying a Split-brain Eviction

In a split-brain node eviction, the following message was present in the OCSSD log ($GRID _home/log/

Clssnmcheckdskinfo:aborting Local node to avoid splitbrain.

And earlier in the same log, within minutes prior to "clssnmcheckdskinfo:aborting Local Node" message:

Clssnmpollingthread:node%s (%n) at <x>% Heartbeat fatal, removal in ...

Finding the cohort

The split-brain message in the Ocssd.log would show "cohort" information. For example:

2012-12-28 20:26:25.803: [Cssd][1111296320]clssnmcheckdskinfo:my cohort:1
2012-12-28 20:26:25.803: [cssd][1111296320]clssnmcheckdskinfo:surviving cohort:2,3,4
2012-12-28 20:26:25.803: [cssd][1111296320] (: CSSNM00008:) clssnmcheckdskinfo:aborting Local node to avoid splitbrain. Cohort of 1 nodes with leader 1, SPRORA01, was smaller than Cohort of 3 nodes led by Node 2, SPRORA02, based on map type 2

Understanding the cohort message

In a split-brain situation, OCSSD on each node records on the voting disk the set of nodes it can communicate with. Each set is known as a "cohort". When there is mutually non-intersecting sets, we have a "split-brain" situation. It means that there is-(or more) separate sets of nodes which cannot talk-to-each and the interconnect.

For example, in the above quote

My cohort:1
Surviving cohort:2,3,4

The meaning of these messages is

* "My cohort:1" = The list of nodes I can communicate with:1
* "Surviving cohort:2,3,4" = from the voting disk, I know that nodes 2,3,4 can all communicate with each other.
* "Cohort of 1 nodes with leader 1, SPRORA01, was smaller than Cohort of 3 nodes led by Node 2, SPRORA02"
= = Oracle Clusterware has identified, the cohort {1} is smaller than the cohort {2,3,4}.

Oracle Clusterware handles the split-brain by terminating all the nodes in the SMALLER cohort. The smaller cohort is {1}. Therefore, OCSSD on node {1} aborts the node.

Using the cohort message to identify interconnect network issues

The cohort message describes which nodes can communicate with each other.

Each cohort are a set of nodes that can talk to each other, and cannot talk to the nodes not in the cohort.

In the above example, the cohort message tells us, nodes {2,3,4} is all in communication; Node 1 is a communication with any of them.

follow-up Action


The private network between node 1 and the other 3 nodes should is checked.

Refer to the following note to check private interconnect network:document 1534949.1-oracle Grid Infrastructure : How to troubleshoot Missed Network Heartbeat evictions

Community Discussions

Still have questions? Use the Communities window below to search for similar discussions or start a new discussion on this subject.

Note:window is the LIVE community not a screenshot.

Click here to open in main browser window.

REFERENCES

Note:1534949.1-oracle Grid infrastructure:how to troubleshoot Missed Network Heartbeat evictions

Oracle Grid infrastructure:understanding split-brain Node Eviction (document ID 1546004.1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.