Go The problem of brain fissure in Elasticsearch cluster

Source: Internet
Author: User

Transfer from http://blog.csdn.net/cnweike/article/details/39083089

The so-called brain fissure problem (similar to schizophrenia) is a different node in the same cluster, which has a distinct understanding of the state of the cluster.

Today, the Elasticsearch cluster has an extremely slow query, viewing the cluster status with the following command:

Curl-xget ' Es-1:9200/_cluster/health '

found that the overall state of the cluster is red, originally 9 nodes of the cluster, in the results only showed 4; however, after the request was sent to a different node, I found that even though the overall status was red, the number of available nodes was inconsistent.

Under normal circumstances, all nodes in the cluster should be consistent with the selection of the master in the cluster, so that the state information obtained should be consistent and inconsistent state information, indicating that the choice of the master node by different nodes is abnormal--the so-called brain fissure problem. Such a cleft state directly causes the node to lose the correct state of the cluster, causing the cluster to not work properly.

Possible causes:

1. Network: Because it is the intranet communication, the network communication problem causes some nodes to think that the master is dead, but the possibility of alternative master is small, and then check the ganglia cluster monitoring, also did not find abnormal intranet traffic, so the reason can be excluded.

2. Node load: Because the master node and the data node are mixed together, so when the workload of the work node is large (and indeed larger), the corresponding ES instance stops responding, and if this server is acting as the master node's identity, Then a part of the node will think that the master node is invalid, so re-election of new nodes, then there is a brain fissure, and because the ES process on the data node occupies a large amount of memory, large-scale memory recycling operations can also cause the ES process to lose response. Therefore, the likelihood of this reason should be the largest.

The answer to the problem:

1. In response to the above analysis, it is inferred that the reason for this is that the node load caused the master process to stop responding, which led to some differences in the choice of master for some nodes. For this reason, an intuitive solution is to separate the master node from the data node. To do this, we add three servers to the ES cluster, but their roles are only the master node, not the role of storage and search, so they are relatively lightweight processes. The roles can be restricted by the following configuration:

[plain] view plain copy

  1. Node.master:true
  2. Node.data:false

Of course, the other nodes can no longer be master, the above configuration can be reversed. This makes it possible to detach the master node from the data node. Of course, in order for the newly joined node to quickly determine the master location, the default master Discovery mode of the data node can be multicast modified to unicast:

[plain] view plain copy

  1. Discovery.zen.ping.multicast.enabled:false
  2. Discovery.zen.ping.unicast.hosts: ["Master1", "Master2", "Master3"]

2. There are also two intuitive parameters that can slow the emergence of a brain fissure problem:

Discovery.zen.ping_timeout (default is 3 seconds): By default, a node will think that if the master node does not answer within 3 seconds, then the node is dead, and increasing this value will increase the time that the node waits for a response. To a certain extent will reduce the miscarriage of judgment.

Discovery.zen.minimum_master_nodes (default is 1): This parameter controls the minimum number of qualifications that a node needs to see for the master node before it can operate in the cluster. The official recommended value is (N/2) +1, where N is the number of nodes with the master qualification (our case is 3, so this parameter is set to 2, but for the case of only 2 nodes, set to 2 is a bit of a problem, after a node down, you must not connect to 2 servers, this need attention).

Go The problem of brain fissure in Elasticsearch cluster

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.