Elasticsearch Some nodes can not find the cluster (brain crack) problem processing

Last Update:2018-07-24 Source: Internet

Author: User

Tags socket

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

description of the phenomenonEs version 1.4.5+centos 6.5

Es1,es2,es3 three es form a cluster, the cluster state is normal, when the ES1 server restarts, es1 can not add to the cluster, their own election as Master, which produced the so-called "brain fissure" es cluster, the es1 ES service restart, es1 can normally discover the cluster and join.
When restarting the ES2 server, Es2 can not be added to the cluster, their own election as Master, but also produced the ES cluster so-called "brain crack", when restarting the ES service, still can not find the cluster.
When the ES3 server is restarted, the ES3 can be added to the cluster. Normal.

Analysis

Three ES server ES services, plug-in versions are the same, configuration in addition to the node name is different. To view the start log discovery for the ES service:
[2015-07-22 16:48:24,628] [INFO] [Cluster.service] [es_node_10_0_31_2] new_master [ES_NODE_10_0_31_2][FDJA3KUTTHC7EJUS4H78FA][LOCALHOST][INET[/10 .0.31.2:9300]]{rack=rack2, Master=true}, Reason:zen-disco-join (Elected_as_master)
Service startup process, because the cluster could not be discovered, the election itself as master
Causes the problem to be possible for network reasons. Because Discovery.zen (a clustered service in es) timed out, it did not find the cluster to elect itself master.
Modify the settings discovery.zen.ping_timeout:30s, the original 10s restart es1 found normal. Modify the Es2 in the same way and find that it is not effective
The settings for modifying es2 are as follows:
Discovery.zen.ping.multicast.enabled:false
Discovery.zen.ping_timeout:120s
Discovery.zen.minimum_master_nodes:2 #至少要发现集群可做master的节点数,
client.transport.ping_timeout:60s
Discovery.zen.ping.unicast.hosts: ["10.0.31.2", "10.0.33.2"] indicates the other node IP in the cluster that may be master in case it cannot be found
After using this method, restarting the ES2 server can find the cluster normally, and the service is normal.

After the experiment, the three ES service configurations were added
Discovery.zen.ping.multicast.enabled:false
Discovery.zen.ping_timeout:120s
Discovery.zen.minimum_master_nodes:2
client.transport.ping_timeout:60s
Discovery.zen.ping.unicast.hosts: ["10.0.31.2", "10.0.33.2"]

Just IP, and the timeout time is slightly different, the Es2 timeout time is set to the longest.
Although the service of Es2 is normal, there will be an exception in the boot log, as follows:
[2015-07-22 21:43:29,012] [WARN] [Transport.netty] [es_node_10_0_32_2] exception caught on transport layer [[id:0x5c87285c]], closing connection
Java.net.NoRouteToHostException:No Route to host
At Sun.nio.ch.SocketChannelImpl.checkConnect (Native Method)
At Sun.nio.ch.SocketChannelImpl.finishConnect (socketchannelimpl.java:717)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect (nioclientboss.java:152)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys (nioclientboss.java:105)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process (nioclientboss.java:79)
At Org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run (abstractnioselector.java:318)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run (nioclientboss.java:42)
At Org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run (threadrenamingrunnable.java:108)
At Org.elasticsearch.common.netty.util.internal.deadlockproofworker$1.run (deadlockproofworker.java:42)
At Java.util.concurrent.ThreadPoolExecutor.runWorker (threadpoolexecutor.java:1142)
At Java.util.concurrent.threadpoolexecutor$worker.run (threadpoolexecutor.java:617)
At Java.lang.Thread.run (thread.java:745)
[2015-07-22 21:43:55,839] [WARN] [Discovery
Suspicion is related to the Internet, although it does not affect the service.

Summary:

After the ES service is started, the time to discover the cluster is a bit long and cannot be found if the time-out is set short. The reason is unknown. Just by modifying the settings so that he can find the cluster as much as possible.

If you see this article is someone who knows the root cause of the problem or a better solution please let us know and appreciate ...

Original address: http://blog.csdn.net/huwei2003/article/details/47004745

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More