ElasticSearch Foundation (3)-Principle

Source: Internet
Author: User

Reference Documentation:

Http://learnes.net/distributed_crud/bulk_requests.html

One, distributed cluster 1.1 empty cluster

A single machine with no data and no index.

A node in the cluster is elected as the Master node for all node management.

Unlike MySQL's cluster architecture, master is only responsible for changes in the cluster category in ES, such as creating or deleting indexes, adding nodes or deleting nodes, and the level of the document can be done at any node, so master does not become a performance bottleneck.

As a user, we can access any node in the cluster , including the master node . Each node knows the location of each document and is able to forward our requests directly to the node that owns the data we want. Regardless of which node we are accessing, it controls the process of collecting responses from the node that owns the data and returns the final result to the client. All of this is managed transparently by Elasticsearch.

1.2 Fault-tolerant transfer

Previously introduced the concept of Shard, divided into primary shard,replicate shard ...

Now we create an index

PUT /blogs{   "settings" : {      "number_of_shards" : 3,      "number_of_replicas" : 1 }}

Now, our cluster looks like the single-node cluster with the index , and the three primary shards are assigned Node 1 .

Now we create a new node, Cluserter.name and the first one identical

When the second node joins, it produces three slave shards (replica shards) , which correspond to three primary Shard one by one respectively. This means that even if a node is damaged, we can guarantee the integrity of the data.

All new documents that are indexed are stored in the primary shard before being copied in parallel to the associated slave shards. This ensures that our documents can be retrieved on both the primary and the slave nodes.

1.3 Horizontal expansion

How do we scale as the application needs grow? If we start the third node, the cluster will automatically reorganize, then become a three node cluster (cluster-three-nodes)

Shards have been reassigned to balance the load:

Node 1 Node 2 A shard is moved on and off, so that there Node 3 are only two shards on each node. This means that each node's hardware resources (CPU, RAM, I/O) are shared by fewer shards, so each shard has better performance.

The Shard itself is a very mature search engine that can use all the resources of a single node. We have a total of 6 shards (3 primary shards and 3 slave shards), so you can scale up to 6 nodes with one shard on each node so that each shard can use the resources of the node 100%.

1.4 Failure Recovery

We have mentioned earlier that Elasticsearch can handle node failures. Let's try it out. If we kill the first node, our cluster will look as follows:

(1) The node being killed is the primary node. In order for the cluster to work properly, a master node must be needed, so the first process is to select a new master node from each node: Node 2 .

(2) The primary 1 Shard 2 Node 1 is lost after we kill it, and our index does not work properly when the primary node is lost. If we check the health of the cluster at this time, it will show that red there is no primary node available!

Fortunately, the full copy of the lost two primary shards is present on the other nodes, so the first thing the new master node does is to promote these in Node 2 and Node 3 out from the Shard to the primary shard , and then the cluster's health becomes back to yellow . The process of Ascension is instantaneous , as if the switch was pressed.

So why is cluster health still the same yellow and not green ? Because now we have 3 primary shards, but we previously set 1 primary shards to have 2 from shards, but now only 1 copies from Shards, so the state cannot be changed green , but we can not worry too much here: when we kill again Node 2 , our program is still You can run without losing any data because you Node 3 still have a backup of each shard.

(3) If we reboot Node 1 , the cluster will be able to redistribute the missing slave shards, so the result will be consistent with the three node from the cluster . If Node 1 the contents of the old nodes are still there, the system will attempt to reuse them and will only replicate the change data during the failure.

So far, we've had a clear view of Elasticsearch's scale-out and data security-related content. Next, we will continue to discuss more details such as the life cycle of the shards.

Ii. Distributed document Storage 2.1 Routing

When you index a document, it is stored on only one primary shard. When owning multiple primary shards how does es know which master shard a document should belong to? When you create a new document, how does es know if it should be stored in shard1 or Shard2? This process can not be random, because we have to take it out later, its routing algorithm is very simple:

Shard = hash (routing)% numberof Primary_shards

The value of routing can be the ID of the document , or it can be a value set by the user himself. The hash calculates a value based on routing and then%primarythe number of shards. This is why Primary_shards cannot be modified when index is created .

2.2 Additions and deletions to the document

If we have the following clusters.

We can send requests to any node in the cluster, and each node has the ability to process requests. Each node knows where each document is located so it can be routed directly past the request. In the following example, we send all requests to NODE1.

Note: The best practice is to poll all of the node to send requests to achieve load balancing.

1. Write operations

creating, indexing, and deleting documents are write operations that must be fully successful before primary Shard can be copied to their corresponding replicas. See Figure9.

Description

1. The client sends a request to Node1 for a write operation.
2.node1 uses the document's _id for routing decisions Shard Discovery is P0 (number No. 0 Primary shard: Primary 0), and the request is routed to NODE3.
3.node3 executed the request on the P0. If the request succeeds, the request is routed in parallel to the R0 on NODE1 NODE2. When all replicas reports are successful, NODE3 sends a success report to the requested node (NODE1), and NODE1 reports to the client.
4. When the client receives a successful execution, the operation has been successfully performed on primary Shard and all replica shards.

Of course, there are some request parameters that can modify this logic. See the original.

2. Read operation

Read operation steps:

1. It doesn't matter if the client sends a request to NODE1, which is master, not master.
2.node1 uses the _id of the document to determine that all copies of the document belonging to Shard 0.shard 0 exist on all 3 nodes. This time, it routes the request to NODE2.
3.node2 returns the document to Node1,node1 to return the document to the client. For a read request, the request node (NODE1) selects a different replica each time the request arrives. Shard to achieve load balancing. Polls all replica shards using a polling policy.

3. Update operations

Update operation, combined with the above two operations: Read, write. See FIGURE11

Steps:

1. Client sends update operation request to NODE1
2.node1 routing the request to the location where Node3,primary Shard is located
3.node3 reads the document from P0, changes the JSON content of the source field, and then attempts to index the modified data in P0. If the document has been modified by another process at this point, it will re-execute the 3 step,which is discarded if it exceeds the number of retry on_conflict settings.
4. If NODE3 successfully updated the document, it will synchronize the new version of the document in parallel to the replica shards of NODE1 and NODE2 to re-establish the index. Once all the replica
Shards reports success, NODE3 returns to the requested node (NODE1) successfully, and then NODE1 returns to the client successfully.

Three, index principle 3.1 per-segment mechanism

Es writes to the reverse index of the disk unchanged, that is, if the reverse index has been established and cannot be updated after it is persisted.

What if the index has an update operation? Use space to change time ... Update data is recorded with a new segment.

This mechanism of ES is called Dynamic Update index. Lucene introduces the mechanism of Per-segment search. A segment (fragment) is a complete reverse-indexed fragment, a subset, that splits the entire reverse index with a series of segments, each of which contains a few commit points.

When a new document is established, it is based on memory operations, writes buffer, and is then written to the segment of the disk. Each 1s is synchronized, which is why ES says its own modification operation has a 1s delay.

1. Operations are performed in memory first, and then timed synchronization policies are used
2. Every once in a while, buffer will be committed: a new segment (an additional new reverse index) will be written to disk, a new commit point is written to disk, and the name of the new segment will be included. Disk Fsync, all data waiting in the kernel file system is written to disk to ensure that they are physically written.
3. The new segment is opened so that it contains documents that can be indexed.
4. The in-memory buffer will be cleaned and ready to receive the new document.

when a new request comes in, it iterates through all the segments. The term Analysis program aggregates all the segments to ensure that each document and term relevance is accurate. In this way, the new document is lightweight and can be added to the corresponding index.

The segments is immutable, so the document cannot be deleted from the old segments, nor can it be updated in the old segments to map a new version of the document, which can be tombstoned . Each submission point contains a. del file that lists which of the Segmen's documents have been deleted. When a document is "deleted", it is simply flagged in the. del file. The "deleted" document can still be indexed, but it will be removed when the final result is returned.

Document updates in the same vein: When a document is updated, the old version of the document is marked for deletion, and the new version of the document is indexed in the new segment. Perhaps the old and new versions of the document will be retrieved, but the old version of the document will be removed when the final result is returned.

Manual on line refresh is not recommended, you can use the API to display Refresh:post/blogs/_refresh

Although the refresh is lighter than the commit, it still consumes. Manual refreshes are useful when testing writes, but do not perform refreshes every time you write in a production environment, which can affect performance. Instead, your app needs to be aware of the nature of es near real-time search and tolerate it.
Not all users need to refresh every second. Perhaps you use ES to index millions of log files, you want to optimize the speed of the index rather than into real-time search.
You can reduce the frequency of refreshes by modifying the configuration item Refresh_interval:

PUT /my_logs{    "Settings": {        "refresh_interval": "30s"     }}

Refresh_interval can be dynamically updated on an existing index. When you create a refresh, you can turn it off automatically when you want to use the index. Shaped like:

put /my_logs/"refresh_interval":-1}put /my_logs/" refresh_interval ":" 1s "}

3.2 Persistence mechanism

Under the per-segment search mechanism above, the new document will be indexed within the minute, but not fast enough. The bottleneck is on the disk. Committing the new segment to disk requires Fsync to secure the physical write. But Fsync is time-consuming. It cannot be called every time the document is updated, otherwise the performance will be very low. It is now necessary to have a lightweight way of making new documents available for indexing, which means that fsync cannot be used to protect them. Between ES and the physical disk is the kernel's file system cache . In the previous description, FIGURE19,FIGURE20, the document indexed in memory is written to a new segment. But now we're going to write segment first to the kernel filesystem cache, which is a very lightweight process and then flush to disk, which is time consuming. But once a segment file is in the kernel's cache, it can be opened to be read.

However, without using Fsync to flush the data to disk, we cannot guarantee that the data will not be lost after the power outage or the process dies.

ES is reliable and ensures that data is persisted to disk by using transaction logs.

A full commit writes segments to disk and writes a commit point that lists all known segments. When ES starts or re-opens an index, it uses this submission point to determine which segments belong to the current shard. What happens if the document is modified at the point of submission? You do not want to lose these modifications:

1. When a document is indexed, it is added to in-memory buffer and added to the Translog log , see FIGURE21.

2.refresh operation will let Shard in Figure22 state: Every second, Shard will be refreshed:

    • The document in In-memory buffer is written to a new segment, but there is no fsync.
    • In-memory buffer is emptied

3. This process will continue: New documents will be added to the in-memory buffer and Translog logs , see Figure23

4. After a period of time, when the translog becomes very large, the index will be flush, the new translog will be established, and a complete commit is completed . See Figure24

  • All documents in the In-memory will be written to the new segment
  • The kernel file system will be fsync to disk.
  • The old Translog log was deleted

That is, the persistence mechanism is implemented through transaction logs and file system caches, called flush operations

In Es, the operation to commit and delete the transaction log is called Flush. The Shard is flush every 30 minutes, or the transaction log is past the assembly. The Flush API can be used to perform a manual flush with the form: Flush index Blogs:post/blogs/_flush
If you want to flush all indexes, wait for the operation to end and return: POST/_flush?wait_for_ongoing

Of course, manual flush is rarely required, and is usually automatic enough. Flush the index is useful when you want to restart or close an index. When Es tries to recover or reopen an index, it must replay all the operations in the transaction log, so the smaller the log, the faster the recovery.

3.3 Merge segment mechanism

By automatically refreshing each second to create a new segment, it won't take long before the number of segments explodes. There are too many segments that are a problem with each segment consuming file handles, memory, and CPU resources. More importantly, each search request needs to examine each segment in turn, and the more segments the more, the slower the query.
ES solves this problem by merging the sections in the background. Small segments are merged into large segments and merged into larger segments. Then delete the old document. You don't have to do anything about this process. ES is automatically processed when you are indexing and searching. During the indexing process, Refresh creates a new segment and opens it.
The merge process will select small segments in the background to merge into large segments, which will not disrupt indexing and searching.
The combined operation is roughly as follows:
1: New segment flush to hard drive
2: New submission point to write new segment, exclude old segment
3: New segment open for search
4: The old segment is deleted
Merging large segments consumes a lot of IO and CPU, which can affect search performance if not checked. By default, ES restricts the merge process so that the search can have enough resources to do so.

Iv. Summary

All operations above ES are transparent to the user

1. ES Natural support for scale-up, disaster recovery, but note primary shards is immutable after the index is created, and if you are in a production environment, consider the scenario of using an index rebuild: Later the use of index aliases to implement Hot index rebuilds

2. ES uses per-segments to segment the inverted index, noting that the inverted index is immutable, and when new documents need to be indexed, new segments actions are used to avoid concurrency, a typical space-time policy

3. The operation is first based on memory, that is, buffering, while using timed synchronization to synchronize memory changes to the segments, the default refresh time is 1S

3. Segments persistence in ES is implemented via file system cache + transaction log, the default flush time is half an hour

ElasticSearch Foundation (3)-Principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.