?? Elasticsearch is inherently support for distributed deployments, and the availability of the system can be improved through cluster deployment. This paper focuses on the cluster node related problems of Elasticsearch, and makes clear that these are the prerequisites for elasticsearch cluster deployment and topological structure design. The configuration file on how to configure the cluster is not mentioned in this article.
node Type
1. Candidate Master nodes (master-eligible node)
?? Once a node is started, the Zen discovery mechanism is used to find the other nodes in the cluster and establish a connection with them. The cluster will elect a master node from the candidate Master node , which is responsible for creating indexes, deleting indexes, allocating shards, and Tracking node state in the cluster. The workload of the primary node in the Elasticsearch is relatively light, and the user's request can be sent to any node that is responsible for distributing and returning the results without having to forward the master node.
?? Under normal circumstances, all nodes in the cluster should be consistent with the selection of the master node, that is, there is only one elected master node in a cluster. However, in some cases, such as network communication problems, the main node because of excessive load stop response and so on, it will lead to re-election of the main node, there may be multiple master nodes in the cluster phenomenon, that is, the node of the cluster state cognitive inconsistency, called brain fissure phenomenon . In order to avoid this situation, the minimum number of candidate master nodes can be set by Discovery.zen.minimum_master_nodes, the recommended setting is (Candidate Master node/2) + 1, for example, when there are three candidate master nodes, the value of the configuration item is (3 /2) +1=2, which is guaranteed to have more than half of the candidate master nodes in the cluster .
?? The set of candidate master nodes is set Node.mater to True, and by default the values of Node.mater and Node.data are true, that is, the node can do both the candidate Master node and the data node. Because the data node carries the data operation, the load is usually very high, so as the cluster expands, it is recommended to separate the two, set the dedicated candidate Master node . When we set Node.data to False, the node is set to the dedicated candidate Master node.
node.=truenode.data=false
2. Data node
?? Data nodes are responsible for data storage and related operations, such as crud, search, aggregation. Therefore, the data node to the machine configuration requirements are relatively high , the first need to have enough disk space to store data, second, the data operation on the system CPU, memory and IO performance is very large. Usually as the cluster expands, more data nodes need to be added to improve availability.
?? As mentioned earlier, the node can be either a candidate master node or a data node, but the load of the data node is heavier, so you need to consider the separation of the two, set up a dedicated data node, to avoid the data node load caused by the primary node does not respond.
node.=falsenode.data=true
3. Client node
?? According to the official introduction, the client node is neither the candidate master node nor the node of the data node, only responsible for the distribution of the request, summary, and so on, that is to say the role of the coordination node. This kind of work, in fact, any one node can be completed, a separate increase of such nodes is more for load balancing.
node.=falsenode.data=false
4. Coordination nodes (coordinating node)
?? The coordination node is a role, not a real elasticsearch node, and you have no way to configure which node is the coordination node through the configuration item. any node in the cluster can act as the role of the coordinating node . When a node a receives a query request from a user, it distributes the query clauses to the other nodes, merges the query results returned by each node, and returns a complete set of data to the user. In this process, node A plays the role of the coordinating node. There is no doubt that the coordination node will have higher CPU and memory requirements.
Shard and Cluster status
?? Shard (Shard), which is the smallest storage unit in the Elasticsearch. The data in one index (index) is typically stored in multiple shards, which may be on the same machine or scattered across multiple machines. The advantage of this is that it helps to scale horizontally, solve the problem of limited disk space and performance for a single machine , and imagine that if dozens of terabytes of data are stored on the same machine, storage space and performance consumption during access is a problem.
?? By default, Elasticsearch allocates 5 shards per index, but this does not mean that you must use 5 shards, and that the more shards you have, the better the performance. It all depends on the assessment and tradeoff of your data volume . Although cross-shard queries are parallel, request distribution, result merging consumes performance and time, so in the case of small amounts of data, it can be inefficient to spread the data across multiple shards. If it is necessary to give a data, I now each shard data volume around 20GB.
?? About Multi-shard and multi-index issues . An index can have more than one shard to complete the storage, but the number of primary shards is specified when the index is created and cannot be modified, so try not to create an index for the data store, or you cannot adjust the data after it expands. The author's suggestion is that for the same type of data, the index is broken down according to time, such as building an index one week, depending on the data growth rate.
?? This is the primary shard (Primary Shard), which, in order to improve service reliability and disaster resilience, typically allocates replication shards (Replica Shard) to increase data redundancy . For example, setting the number of replication shards to 1 o'clock will make a backup of each primary shard.
?? The status of the cluster can be viewed through the API (Http://localhost:9200/_cluster/stats?pretty), which typically consists of three types of clusters:
- Red, which indicates that there are no primary shards assigned, and some data is not available.
- Yellow, which indicates that the primary shard is allocated and that the data is available, but there are no replication shards assigned.
- Green, which indicates that both the primary and replication shards are assigned, everything is fine.
Deployment Topology
?? Finally, take a look at the topology of the two cluster deployments, where we don't consider the tuning of individual nodes. Topology diagram one is a simple cluster deployment for scenarios where the volume of data is small. There are three nodes in the cluster and three are the candidate master nodes, so we can set the minimum number of working candidate master nodes to be 2. Nodes 1 and 2 also act as data nodes, and data from two nodes are backed up from one another. Such a deployment structure in the expansion process, is usually based on the need to gradually join the dedicated data node, and finally consider separating the data node and candidate Master node, but also the development of Topology Diagram II structure. In topology Diagram II, there are three dedicated candidate master nodes, used for cluster state maintenance, data nodes to increase as needed, attention only to increase the dedicated data node.
Topology Diagram I
Topology Diagram II
(full text, this article address: http://blog.csdn.net/zwgdft/article/details/54585644)
Bruce,2017/01/18
Talking about the cluster deployment of Elasticsearch