description of the phenomenonEs version 1.4.5+centos 6.5
Es1,es2,es3 three es form a cluster, the cluster state is normal, when the ES1 server restarts, es1 can not add to the cluster, their own election as Master, which produced the so-called "brain fissure" es cluster, the es1 ES service restart, es1 can normally discover the cluster and join.
When restarting the ES2 server, Es2 can not be added to the cluster, their own election as Master, but also produced the ES cluster so-called "brain crack", when restarting the ES service, still can not find the cluster.
When the ES3 server is restarted, the ES3 can be added to the cluster. Normal.
Analysis
Three ES server ES services, plug-in versions are the same, configuration in addition to the node name is different. To view the start log discovery for the ES service:
[2015-07-22 16:48:24,628] [INFO] [Cluster.service] [es_node_10_0_31_2] new_master [ES_NODE_10_0_31_2][FDJA3KUTTHC7EJUS4H78FA][LOCALHOST][INET[/10 .0.31.2:9300]]{rack=rack2, Master=true}, Reason:zen-disco-join (Elected_as_master)
Service startup process, because the cluster could not be discovered, the election itself as master
Causes the problem to be possible for network reasons. Because Discovery.zen (a clustered service in es) timed out, it did not find the cluster to elect itself master.
Modify the settings discovery.zen.ping_timeout:30s, the original 10s restart es1 found normal. Modify the Es2 in the same way and find that it is not effective
The settings for modifying es2 are as follows:
Discovery.zen.ping.multicast.enabled:false
Discovery.zen.ping_timeout:120s
Discovery.zen.minimum_master_nodes:2 #至少要发现集群可做master的节点数,
client.transport.ping_timeout:60s
Discovery.zen.ping.unicast.hosts: ["10.0.31.2", "10.0.33.2"] indicates the other node IP in the cluster that may be master in case it cannot be found
After using this method, restarting the ES2 server can find the cluster normally, and the service is normal.
After the experiment, the three ES service configurations were added
Discovery.zen.ping.multicast.enabled:false
Discovery.zen.ping_timeout:120s
Discovery.zen.minimum_master_nodes:2
client.transport.ping_timeout:60s
Discovery.zen.ping.unicast.hosts: ["10.0.31.2", "10.0.33.2"]
Just IP, and the timeout time is slightly different, the Es2 timeout time is set to the longest.
Although the service of Es2 is normal, there will be an exception in the boot log, as follows:
[2015-07-22 21:43:29,012] [WARN] [Transport.netty] [es_node_10_0_32_2] exception caught on transport layer [[id:0x5c87285c]], closing connection
Java.net.NoRouteToHostException:No Route to host
At Sun.nio.ch.SocketChannelImpl.checkConnect (Native Method)
At Sun.nio.ch.SocketChannelImpl.finishConnect (socketchannelimpl.java:717)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect (nioclientboss.java:152)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys (nioclientboss.java:105)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process (nioclientboss.java:79)
At Org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run (abstractnioselector.java:318)
At Org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run (nioclientboss.java:42)
At Org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run (threadrenamingrunnable.java:108)
At Org.elasticsearch.common.netty.util.internal.deadlockproofworker$1.run (deadlockproofworker.java:42)
At Java.util.concurrent.ThreadPoolExecutor.runWorker (threadpoolexecutor.java:1142)
At Java.util.concurrent.threadpoolexecutor$worker.run (threadpoolexecutor.java:617)
At Java.lang.Thread.run (thread.java:745)
[2015-07-22 21:43:55,839] [WARN] [Discovery
Suspicion is related to the Internet, although it does not affect the service.
Summary:
After the ES service is started, the time to discover the cluster is a bit long and cannot be found if the time-out is set short. The reason is unknown. Just by modifying the settings so that he can find the cluster as much as possible.
If you see this article is someone who knows the root cause of the problem or a better solution please let us know and appreciate ...
Original address: http://blog.csdn.net/huwei2003/article/details/47004745