Understand the split-brain protocol of the javasleracbrainsplitresolution Cluster

Source: Internet
Author: User
OracleRACCSS provides two types of backend services: group Management (groupmanagement) and node monitoring (NodeMonitor (

Oracle rac css provides two background services: Group Management (GM) and Node Monitor (NM (

How CSS works

Before understanding the Brain Split processing process, it is necessary to introduce the working framework of Oracle RAC Css (Cluster Synchronization Services:

Oracle rac css provides two background services: Group Management (GM) and Node Monitor (NM), where the GM management group (Group) and lock (lock) services. At any time in the cluster, a node will always act as the GM master node ). Other nodes in the cluster send GM requests to the master node serially, and the master node broadcasts the change information of the cluster members to other nodes in the cluster. Group membership is synchronized every time a cluster reconfiguration is reset. Each node independently interprets the changes of cluster members.

The node monitoring NM Service maintains node information consistency with cluster software of other vendors through skgxn (skgxn-libskgxn.a, database that provides node monitoring. In addition, NM also provides maintenance for the well-known Network heartbeat and Disk heartbeat to ensure that the node remains alive. When a cluster member does not have a normal Network heartbeat or Disk heartbeat, NM is responsible for kicking the member out of the Cluster. If the Member is kicked out of the cluster, the node will be restarted (reboot ).

The NM service uses records in OCR (Interconnect information is recorded in OCR) to understand the endpoints for listening and interaction, and sends heartbeat information over the network to other cluster members. At the same time, it also monitors the Network heartbeat from all other cluster members. the Network heartbeat will occur every second. If the Network heartbeat of a node is at misscount (by the way: in 10.2.0.1, the default value of misscount is 60 s for Linux, and that of other platforms is 30 s. If a third-party vendor clusterware is used, the value of disktimeout is 600 s, but disktimeout is not introduced in 10.2.0.1; If 10.2.0.4 is later, disktimeout is 200 s; misscount is 30 s after 11.2: CRS-4678: Successful get misscount 30 for Cluster Synchronization Services, CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services) Specified If the node is not received in seconds, the node is considered dead. NM is also responsible for initializing the cluster reset (Initiates cluster reconfiguration) when other nodes join or leave the cluster ).

In scenarios where split-brain occurs, NM also monitors voting disks to learn about other competing sub-clusters (subclusters ). It is necessary to introduce the sub-clusters. Imagine that there are a large number of nodes in our environment. It is our imagination to use the environment of the 128 nodes officially built by Oracle, when a network failure occurs, there are multiple possibilities. One possibility is that the global network fails, that is, each of the 128 nodes cannot have a network heartbeat, at this time, up to 128 "isolated island" sub-clusters are generated. Another possibility is local network failure. The 128 nodes are divided into multiple parts, each of which contains more than one node, these parts can be called subclusters ). When a network failure occurs, multiple nodes in the sub-cluster can still communicate with each other to transmit the voting information (vote mesg). However, the sub-clusters or isolated island nodes cannot communicate with each other through the conventional Interconnect Network, in this case, the voting disk is required for the NM Reconfiguration.

Voting Disk

Because NM uses voting disk to solve communication barriers caused by network faults, it is necessary to ensure that voting disk can be accessed at any time. Under normal conditions, each node performs a disk heartbeat activity. Specifically, it writes the disk heartbeat information to a block of the voting disk. This activity occurs every second, at the same time, CSS will also read a "kill block" called "kill block" every second. When the "kill block" content indicates that the node is evicted from the cluster, CSS automatically restarts the node.

In order to ensure that the above disk heartbeat and read the "kill block" activities are always operating normally, CSS requires that at least (N/2 + 1) Voting disks should be accessed by nodes, this ensures that at least one voting disk is accessible to each of the two nodes. Under normal circumstances (note that it is normal) as long as the node can access more online voting disks than the voting disks that cannot be accessed, the node can live happily. When the unaccessible voting disk is more than the normal voting disk, the Cluster Communication Service process will fail and cause the node to restart. Therefore, there is a saying that only two voting disks are enough to ensure redundancy, and there is no need to have three or more voting disks. This is wrong. The recommended Oracle cluster must have at least three voting disks.

Question:

Some people asked, What is the odd number of voting disks?

Answer:

In fact, we only recommend that you use an odd number of vote disks instead of an odd number. In 10gR2, the maximum number of vote disks is 32.

Question

Can we use 2 or 4 vote disks?

Answer:

Yes. However, the numbers 2 and 4 are unfavorable in the hard algorithm of "at least (N/2 + 1) Voting disks to be normally accessed by nodes:

When we use two vote disks, the heartbeat of any vote disk cannot fail.

When we use three vote disks, the heartbeat of a vote disk cannot be greater than one.

When we use four vote disks, the heartbeat failure of one or more vote disks cannot occur, which is the same as the success rate of three, but because we have more vote disks, this leads to increased management costs and introduced risks.

When we use five vote disks, the heartbeat of no more than two vote disks fails.

When we use six vote disks, the heartbeat failure of more than two vote disks still cannot occur. The same reason is that one more than five disks will introduce unreasonable management costs and risks.

Question:

If the network heartbeat between nodes is normal and the vote disk of the node's normal heartbeat is greater than that of the node that cannot be normally accessed, for example, when three votedisks happen to have one vote disk, the disk heartbeat times out, will the Brain split happen at this time?

Answer:

In this case, the Brain Split is not triggered, and the node eviction protocol is not triggered ). When a single or less voting disk heartbeat failure (N/2 + 1) fails, this heartbeat failure may be caused by an I/O error when the node accesses the voting disk in a short period of time. At this time, css immediately marks the failed voting disk as OFFLINE. Although there are a certain number of voting disk OFFLINE, we still have at least (N/2 + 1) voting disks available, which ensures that eviction protocol will not be called, therefore, no node will be restarted by reboot. The Disk ping monitor Thread (DPMT-clssnmDiskPMT) of the node Monitor module will attempt to access the failed OFFLINE voting disk again, if these voting disks become I/O-accessible again and the data on the voting disks are verified to be correct, css will mark the voting disk as ONLINE again; however, if you still cannot normally access the related voting disk within 45 s (45s is obtained based on misscount and internal algorithms), DMPT will be stored in cssd. generate warning information in log, such:

CSSD] 20:11:20. 668>

WARNING: clssnmDiskPMT: long disk latency> (45940 MS) to voting disk (0 // dev/asm-votedisk1)

Assuming that there are three voting disks in the RAC scenario where the clssnmDiskPMT warning occurs, there is now a asm-votedisk1 marked as OFFLINE for I/O error or other reasons, if another votedisk also has a problem and disk heartbeat fails, the node will cause eviction protocol because it is less than the specified number (2) of votedisk, and then restart reboot.

A single voting disk heartbeat failure (N/2 + 1) generates only a Warning instead of a fatal error. Because a large number of vote disks are still accessible, the generated warnings are non-fatal and the eviction protocol will not be triggered.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.