Anatomy of the Elasticsearch Cluster Series II: Distributed three C, Translog, and Lucene segments

Source: Internet
Author: User

Reprint: http://www.infoq.com/cn/articles/anatomy-of-an-elasticsearch-cluster-part02

Consensus-the importance of split-brain issues and legal votes

Consensus is a fundamental challenge for distributed systems. It requires that all processes/nodes in the system must agree on the value/state of the given data. There are already many consensus algorithms such as raft, Paxos, and so on, mathematically proved to be OK. However, Elasticsearch has achieved his own consensus system (Zen Discovery), and Elasticsearch's father Shay Banon explains why in this article. The Zen Discovery module consists of two sections:

    • Ping: The execution node uses ping to discover each other
    • Unicast (Unicast): This module contains a list of host names that control which nodes need to be ping-

Elasticsearch is an end-to-end system in which all nodes are connected to each other and a master node remains active, which updates and controls the state and operations within the cluster. Establish a new Elasticsearch cluster after an election, the election is part of the ping process, one master is selected on all eligible nodes, and the other node joins the master node. The default value ping_interval for the ping interval parameter is 1 seconds, and ping_timeout The default value for the ping timeout parameter is 3 seconds. Because the nodes are joined, they send a request to the master node, and the join_timeout default value for the ping_timeout join timeout parameter is 20 times times the value. If Master is having problems, other nodes in the cluster start pinging again to start another election. This ping process can also help a node discover master through other nodes when it suddenly loses master.

Note: By default, the client node and the data node do not participate in this election process. You can change this default behavior in the ELASTICSEARCH.YML discovery.zen.master_election.filter_client configuration file discovery.zen.master_election.filter_data by setting false properties and properties.

The principle of fault detection is that the master node pings all other nodes to check if they are still alive, and then ping all the nodes to tell master that they are still alive.

If you use the default settings, Elasticsearch may be plagued by split-brain problems. In the case of a network partition, a node can assume that Master is dead and then choose himself as Master, which results in multiple master occurrences within a cluster. This can result in data loss, or it may not be possible to merge data correctly. According to the following formula, based on the number of nodes eligible to participate in the election, set the value of the legal votes attribute, to avoid the occurrence of bursting.

discovery.zen.minimum_master_nodes = int(# of master eligible nodes/2)+1

This attribute requires that the number of quorum nodes be added to the newly elected master node to complete and obtain the master identity accepted by the new master node. This property is important for ensuring cluster stability and dynamically updating when the size of the cluster changes. Figures A and b demonstrate what happens when a property is set or not set minimum_master_nodes in the case of a network partition.

Note: for a production cluster, it is recommended to use 3 nodes dedicated to master, and these 3 nodes will not serve any client requests, and there is always only 1 active at any given time.

We've figured out how to deal with the consensus in Elasticsearch and now let's see how it handles concurrency.

Concurrent

Elasticsearch is a distributed system that supports concurrent requests. When a Create/update/delete request arrives at the primary Shard, it is also sent to the Shard copy in parallel. However, the order in which these requests arrive may be in a disorderly sequence. In this case, Elasticsearch uses optimistic concurrency control to ensure that newer versions of the document are not overwritten by the old version.

Each indexed document has a version number, and the version number is incremented and applied to the document each time the document is changed. These version numbers are used to ensure orderly acceptance of changes. To ensure that updates in our app do not result in data loss, the Elasticsearch API allows us to specify the current version number of the file so that changes are accepted. If the version number that is specified in the request is older on the score slice, the request fails, which means that the document has been updated by another process. How to handle failed requests can be controlled at the application level. Elasticsearch also provides additional lock options that can be read through this article.

When we send concurrent requests to elasticsearch, the next question is, how do you ensure that these requests are read-write consistent? Now, it is not clear how elasticsearch should fall on the side of the cap triangle, and I am not going to address this long-established argument in this article.

However, let's take a look at how to use Elasticsearch to achieve consistent write-reading.

Consistent-ensure read-write consistency

For write operations, Elasticsearch supports a consistency level that, unlike most other databases, allows for pre-checking to see how many available shards are allowed to write. The optional values are quorum, one , and all. The default setting is quorum, which means that write operations are allowed only if most shards are available. Even if most shards are available, the write replica fails for some reason, in which case the replica is considered a failure and the Shard is rebuilt on a different node.

For read operations, new documents can be searched only after the refresh interval has elapsed. To ensure that the search request returns results that contain the latest version of the document, you can set replication to Sync(the default), which causes the operation to return a write request after both the primary Shard and the replica fragment are complete. In this case, the search request returns results from any Shard that contain the latest version of the document. Even if our application sets the Replication=asyncfor a higher index rate, we can still set the parameter _preference to primaryfor the search request. In this way, the search request queries the primary Shard and ensures that the document in the results is the latest version.

We've seen how Elasticsearch handles consensus, concurrency, and consistency, so let's look at some of the key concepts inside the Shard that make Elasticsearch a distributed search engine.

Translog (pre-write log)

Because of the development of relational databases, the concept of a pre-written log (WAL) or transaction log (Translog) has long been pervasive in the database domain. In the event of a failure, translog can ensure the integrity of the data. The basic principle of translog is that changes must be recorded and submitted before the actual changes to the data are committed to disk.

When a new document is indexed or the old document is updated, the Lucene index is changed, and the changes are committed to the disk for persistence. This is a very expensive operation if it is executed after each request. Therefore, this operation is performed once when multiple changes are persisted to the disk. As we described in the previous article, the Flushing (flush) operation that is submitted by Lucene is performed by default every 30 minutes or when Translog becomes too large (default 512MB). In such cases, it is possible to lose all changes between the 2 Lucene submissions. In order to avoid this problem, Elasticsearch adopted the translog. All index/delete/update operations are written to Translog, and after each index/delete/update operation is performed (every 5 seconds by default), Translog is synchronized to ensure that the changes are persisted. After the translog is synchronized to the primary shard and replica, the client receives a write request acknowledgement.

In the case of a hardware failure between two Lucene submissions, you can replay translog to recover any missing changes from the last Lucene submission, and all changes will be accepted by the index.

Note: It is recommended that flush translog be performed explicitly before restarting the Elasticsearch instance, so that the boot is faster because the translog to replay is emptied. The POST/_all/_flush command can be used to flush all indexes in the cluster.

With Translog Flushing, segments in the file system cache are committed to disk, making the changes in the index persistent. Now let's take a look at the Lucene segment.

Segments of Lucene

The Lucene index is made up of multiple segments, and the segment itself is a fully functional inverted index. Segments are immutable, allowing Lucene to incrementally add new documents to the index without rebuilding the index from scratch. For each search request, all segments in the index are searched, and each segment consumes the CPU's clock week, file handle, and memory. This means that the higher the number of segments, the lower the search performance.

To solve this problem, Elasticsearch merges small segments into a larger segment (as shown), submits a new merge segment to the disk, and deletes those old small segments.

This is done automatically in the background without disrupting indexing or searching. Because segment merging exhausts resources and affects search performance, Elasticsearch controls the merge process and provides sufficient resources for search.

Anatomy of the Elasticsearch Cluster Series II: Distributed three C, Translog, and Lucene segments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.