Coherence the departure and joining of the cluster members

Source: Internet
Author: User

Recently in the customer environment coherence cluster is unstable, so find some documents, need to figure out some coherence internal mechanisms

1. Departure of cluster members

With regard to state detection, the official argument is:

Death detection is a cluster mechanism the quickly detects when a cluster member have failed. Failed cluster members is removed from the cluster and all other cluster members is notified about the departed member.

Death detection allows the cluster to differentiate between actual member failure and an unresponsive member, such as the Case is a JVM conducts a full garbage collection. (detection allows the cluster to differentiate between true member failures or because the JVM does not echo the GC)

Death detection identifies both process failures (tcpring component) and hardware failure (ipmonitor component).

Process failure is detected using a ring of TCP connections opened on the same port that's used for cluster UDP Communica tion. Each cluster member issues a unicast heartbeat, and the most senior cluster member issues the cluster heartbeat, which are A broadcast message.

Hardware failure is detected using the Java InetAddress.isReachable method which either issues a trace ICMP ping, or a pseudo ping and uses T CP Port 7. Death detection is enabled by default and is configured within the <tcp-ring-listener> element.

Find some documents yourself, there should be 3 kinds of situations:

1. Process failure (including ctrl-c,kill-9, etc.), through tcpring, the so-called ring is a connection established with the TCP protocol, the port is the UDP port of the cluster, as shown in the figure below, the first ring covers all nodes of the entire cluster, joins into a ring, The other rings are linked in the role of their respective roles.

    • Each node has a maximum of two incoming and two outgoing connections.
    • The detection is performed within each role.
    • If a ring is broken, it means there is a problem, the TCP connection is an active indication and does not require a vote mechanism.
    • Death detection is the network round trip time, Sub-millisecond level

2. Hardware failure, ping, or ICMP ping in the document, continuous failure of ping indicates machine death, (15s), any machine, at the same time by the other side kernel answer, do not need to coherence node confirmation.

3.Package Timeout.

All failures (including hardware, network, process), the principle is that the packet transmission must be confirmed (Acknoledge), otherwise the default state 200ms retransmission, 5 minutes after the timeout

Both the sending side and the receiving end are marked as declared dead.

Then a plurality of nodes vote to determine whose state is dead.

Find an example on the forum

* Member have a timeout sending a packet to Member 16
* Member asks Member and Member to confirm departure of Member 16
* Member rejects the confirmation request (@ 2010-09-22 05:21:43.411)
* Member accepts the confirmation request (I assume it does as it has no rejection on its log)
* Member informs the rest of the cluster that Member have departed
* Member 1 (the senior Member) heartbeats Member + causing it to re-initialise itself-it then rejoins as Member 127.

This is the most uncertain, and the mechanism details are not too clear

2. Joining of cluster members

Cluster members are responsible for joining the cluster through the Cluster service, and it is my understanding that Cluster service detects and joins the cluster through multicast or unicast, and of course needs to ensure clustername consistency.

-dtangosol.coherence.cluster=myfirstcluster

Cluster Service:this Service is automatically started when a Cluster node must join the Cluster; Each cluster node is always have exactly one service of this type running. This service was responsible for the detection of other cluster nodes, for detecting the failure of a cluster node, and for Registering the availability of other services in the cluster.

3. Cluster communication

Coherence Cluster Communication through the tcmp (Tangosol Cluster Management Protocol) protocol, both multicast and unicast can be carried out.

About the TCMP document is more clear, excerpt a paragraph.

TCMP is a ip-based protocol that's used to discover cluster members, manage the cluster, provision services, and TRANSMI T data. TCMP can configured to use:

    • A combination of UDP/IP multicast and UDP/IP unicast. This is the default configuration.

    • UDP/IP unicast only (so is, no multicast). See "Disabling Multicast Communication". This configuration was used for network environments that does not support multicast or where multicast are not optimally conf Igured.

    • TCP/IP only (no UDP/IP multicast or UDP/IP unicast). See "Using the TCP Socket Provider". This configuration is used for network environments, that favor TCP.

    • SDP/IP only (no UDP/IP multicast or UDP/IP unicast). See "Using the SDP Socket Provider". This configuration was used for network environments, that favor SDP.

    • SSL over TCP/IP or SDP/IP. See "Using the SSL Socket Provider". This configuration was used for network environments, require highly secure communication between cluster members.

Use of multicast

Multicast is used as follows:

    • Cluster Discovery:multicast is used to discover if there are a Cluster running that a new member can join.

    • Cluster Heartbeat:the most senior member in the Cluster issues a periodic heartbeat through multicast; The rate can is configured and defaults to one per second.

    • Message delivery:messages that must is delivered to multiple cluster members is often sent through multicast, instead of Unicasting the message one time to each member.

Use of Unicast

Unicast is used as follows:

    • Direct Member-to-member (point-to-point) communication, including messages, asynchronous acknowledgments (ACKs), Asynchronous negative acknowledgments (NACKs) and Peer-to-peer heartbeats. A majority of the communication on the cluster is point-to-point.

    • Under some circumstances, a message may be sent through unicast even if the message was directed to multiple members. This is the done to shape traffic flow and to reduce the CPU load in very large clusters.

    • All communication are sent using unicast if multicast communication is disabled.

Use of TCP

TCP is used as follows:

    • A TCP/IP ring is used as an additional death detection mechanism to differentiate between actual node failure and an unres Ponsive node (for example if a JVM conducts a full GC).

    • TCMP can configured to exclusively with TCP for data transfers. Like UDP, the transfers can is configured to use only unicast or both unicast and multicast.

Protocol Reliability

The TCMP protocol provides fully reliable, in-order delivery of all messages. Since the underlying UDP/IP protocol does not provide for either reliable or in-order delivery, TCMP uses a queued, fully Asynchronous Ack-and nack-based mechanism for reliable delivery of messages, with unique integral identity for guaranteed Ordering of messages.

Protocol Resource Utilization

The TCMP protocol (as configured by default) requires only three UDP/IP sockets (one multicast, both unicast) and six Threa DS per JVM, regardless of the cluster size. This was a key element in the scalability of Coherence; Regardless of the number of servers, each node in the cluster still communicates either point-to-point or with collections of cluster members without requiring additional network connections.

The optional TCP/IP ring uses a few additional TCP/IP sockets, and an additional thread.

Protocol tunability

The TCMP protocol is very tunable to take advantage of specific network topologies, or to add tolerance for low-bandwidth and high-latency segments in a geographically distributed cluster. Coherence comes with a pre-set configuration. Some TCMP attributes is dynamically self-configuring at run time, but can also is overridden and locked down for Deployme NT purposes.

Problem:

1. When a node or proxy of a cluster member is very busy, will the cluster be removed by everyone?

2. Will it be possible to rejoin the cluster after his status is restored?

It needs to be proved by experiment.

Coherence the departure and joining of the cluster members

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.