Elasticsearch Introduction Series (IV.) distributed research

Source: Internet
Author: User

Preface: Elasticsearch is committed to hiding the complexities of distributed systems, and the following operations are done automatically at the bottom:

Partition your documents into different containers or shards (shards), they can exist in one or more nodes

Distributes shards evenly across nodes to load-balance search for indexes

Redundancy of each shard to prevent data loss due to hardware failure

Route requests on any node in the cluster to the node where the corresponding data resides

Whether you are adding nodes or removing nodes, shards can be seamlessly scaled and migrated

First, the internal working mode of the cluster

The Elasticsearch is used to build highly available, scalable systems. The way to expand can be to buy a better server extension (scaling vertical scale or scaling up) or buy more servers (scale-out horizontal scales or scaling out)

Although Elasticsearch can get better frontal performance from more powerful hardware, the vertical extension has his limitations, and the real expansion should be horizontal, by adding nodes to divide the load and increase reliability.

For most databases, scale-out means that your program will make great changes to take advantage of these newly added devices, and in contrast, Elasticsearch is inherently distributed, knowing how to manage nodes to provide high scale and high availability.

Second, empty cluster

If we start a separate node, he has no data and no index, and this cluster looks like:

A node is a Elasticsearch instance, and a cluster (cluster) consists of one or more nodes, They have the same cluster.name, they work together, share data and load, and when they join a new node or delete a node, the cluster perceives and balances the data.

A node in a cluster is elected as the primary node (master) He will temporarily manage some changes at the cluster level, such as creating or deleting indexes, adding or removing nodes, and so on. The master node does not participate in document-level changes or searches, which means that the primary node does not become a cluster bottleneck when traffic grows, and any node can become the primary node, which is the primary node when there is only one node.

As a user, we are able to communicate with any node in the cluster, including the master node. Each node knows which node the document exists on, they can forward the request to the appropriate node, the node we are accessing is responsible for collecting data returned by each node of the data, and the last issue is returned to the client, all of which are handled by Elasticsearch,

Second, cluster health

A lot of information can be monitored in elasticsearch clusters, but only one is the most important: Cluster health (cluster). There are three types of cluster health status: Green yellow red

Request: GET/_cluster/health

If the query is in an empty cluster without an index, it will return:

{   "cluster_name":          "elasticsearch",   "status":                "green", <1> "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 0, "active_shards": 0, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0}

 

The <1>status field provides a comprehensive indicator of the service state of the cluster

Green: All major shards and replicated shards are available

Yellow: All primary shards are available, but not all replication shards are available

Red: Not all major shards are available

Third, add the index

In order to add data to elasticsearch, we need to index (index) a place where the associated data is stored, and in fact, the index knowledge is a logical namespace that points to one or more shards (shards) (logical namespace)

A shard (shard) is a minimal level of work unit (worker unit) and it simply saves a portion of all the data in the index. , a shard is a lucene instance, and he is a complete search engine by itself.

Sharding is the key to distributing data in a cluster, elasticsearch the shards into a data container, storing a document in a Shard, and then allocating the shards to the nodes in your cluster, and when the cluster expands or shrinks the Elsticsearch will automatically migrate the shards on your node to make the cluster balanced.

A shard can be either a primary shard or a copy shard, and each document in your index belongs to a separate primary shard, so the number of primary shards determines how much data the index can store

Copy a copy of The Shard Knowledge primary Shard, which prevents data loss due to a hardware failure, and can provide read requests such as searching or retrieving documents from other Shard

When the index creation is complete, the number of primary shards is fixed, but the number of replicated shards can be adjusted at any time.

We create an index called blogs on the only empty node in the cluster, and by default an index is allocated 5 primary shards, currently we only Shard 3 primary shards and one copy shard

PUT /blogs{   "settings" : {      "number_of_shards" : 3,      "number_of_replicas" : 1 }}

Single node cluster with index:

Then we check the cluster health:

{   "cluster_name":          "elasticsearch",   "status":                "yellow", <1> "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 <2>}

<1> the status of the cluster is now yellow

<2> Our three secondary shards are not yet assigned to the node

The health status of the cluster yellow indicates that all primary shards are up and running--the cluster has been able to handle any request normally--but the replication shards are not yet fully available, and the fact that the three replicated shards are now in the unassigned state, they have not been assigned a node, It is not necessary to save the same copy of the data on the same node, and if the node fails, all copies of the data will be lost,

Iv. increasing the failover

Single node operation means analysis of single points of failure-no data backup. To prevent failure, we need to start another node

As long as the second node has the same cluster.name as the first node, he can automatically discover and join the cluster where the first node resides, and if not, the network broadcast may be disabled, or the firewall will block the node communication.

A two-node cluster-all primary and replicated shards are assigned.

The second node has joined the cluster, and three replication allocations have been allocated-corresponding to a single primary shard, which means that the loss of any one node can still guarantee the integrity of the data,

The index of the document is first stored in the primary shard and then replicated to the corresponding replication node, which ensures that our data can be retrieved on both the primary and the replication nodes.

Cluster health status:

{   "cluster_name":          "elasticsearch",   "status":                "green", <1> "timed_out": false, "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 3, "active_shards": 6, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0}

 

Indicates that our cluster is not only fully functional, but also highly available.

V. Horizontal expansion

We start a third node. Our cluster will reorganize itself.

Node3 contains a shard from Node1 and Node2, so that each node has two shards, one less than before, which means that shards on each node will get more hardware resources. The Shard itself is a complete search engine, he can use all the resources of a single node, we have 6 shards 3 primary 3 pairs can scale up to 6 nodes, each node has a shard, each shard can 100% use the resources of this node.

Six. Continue to expand

PUT /blogs/_settings{   "number_of_replicas" : 2}

If we are going to scale to more than 6 nodes, we can increase the number of replicated shards to increase the nodes.

Vii. Coping with Failures

If we kill the process of a node

We kill a master node, a cluster must have a master node to make it function properly, so the first thing the cluster does is to elect a new master node for each node.

Primary shards 1 and 2 are lost when we kill Node1, our index does not work when we lose the primary shard, if we check the health of the cluster, we will see the status red, not all the primary nodes are available.

The new master node upgrades the replication shards on Node2 and node3 to primary shards, which is the cluster back to the yellow state.

Why yellow instead of green?

We have three primary shards, but we specify that each primary shard corresponds to two replicated shards, but only one copy shard is currently allocated, so green cannot be reached.

Elasticsearch Introduction Series (IV.) distributed research

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.