Introduction to ETCD non-relational database

Last Update:2018-08-21 Source: Internet

Author: User

Tags zookeeper etcd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is ETCD service?

ETCD is a health/value pair storage System with HTTP protocol, which is a distributed and functional hierarchy configuration system, which can be used to build service discovery system. Distributed, consistent kv storage system for shared configuration and service discovery. It is easy to deploy, install, and use, providing reliable data persistence features. It is secure and the documentation is complete.

ETCD the latest stable version of the project is 3.3.9 for specific information, refer to [Project home] and [Github]. ETCD is an open source project initiated by CoreOS Corporation, which is an Apache licensing agreement.

Provide configuration sharing and service discovery of the system more, of which the most well-known is zookeeper, and ETCD can be regarded as a rising star. In the project realization, the consistency agreement easy to understand, the operation, the security and so on many dimensions, Etcd compared zookeeper all occupies the superiority

The difference between 2.Zookeeper and ETCD:

1) Conformance agreement: ETCD uses raft protocol,. Zookeeper use Zab (Class Paxos protocol), the former easy to understand, easy to achieve project;

2) Operation Dimension: ETCD convenient operation and maintenance, ZK difficult to transport;

3) project activity: ETCD community and Development active, ZK is almost dead;

4) API:ETCD provides Http+json, GRPC interface, cross-platform cross-language, ZK needs to use its client;

5) Access security: ETCD support HTTPS access, ZK is missing in this respect;

3.ETCD Application Scenarios:

Configuration management, service registration in discovery, select master, application scheduling, distributed queue, distributed lock

4.ETCD mainly provides the following capabilities:

1) provides an interface for storing and acquiring data, which ensures strong consistency of data across multiple nodes in the ETCD cluster through the protocol. Used to store meta information and shared configurations.

2) provide a listening mechanism, the client can listen to a key or some key changes (V2 and v3 different mechanisms, see the following article). For monitoring and push changes.

3) The expiration and renewal mechanism of key is provided, and the client can renew it by timed refresh (the implementation mechanism of V2 and V3 is not the same). Used for cluster monitoring and service registration discovery.

4) provides atomic CAs (Compare-and-swap) and CAD (Compare-and-delete) Support (V2 through interface parameters, V3 through bulk transactions). For distributed locks and leader elections.

How the 5.ETCD works:

ETCD uses the raft protocol to maintain the consistency of the state of each node within the cluster. Simply put, the ETCD cluster is a distributed system, which consists of multiple nodes communicating with each other to form the whole external service, each node stores the complete data, and the data maintained by each node is consistent through the raft protocol.

Working principle Diagram

, each ETCD node maintains a state machine, and at most one valid primary node exists at any given time. The master node handles all write operations from the client, guaranteeing that the write operation to the state machine will be reliably synchronized to the other nodes through the raft protocol.

6. Number of cluster nodes

ETCD uses the raft protocol to ensure that the state of each node is consistent. According to the principle of raft algorithm, the more nodes number, the lower the write performance of the cluster. This is because every write operation requires that most of the nodes in the cluster will log off successfully after the leader node is able to modify the internal state machine and return the results back to the client.

That is, in the equivalent configuration, the fewer node points, the better the cluster performance. Obviously, it doesn't make sense to deploy only 1 nodes. Typically, the cluster nodes are deployed as 3,5,7,9 nodes as required.

Can I choose an even number of nodes here? It's better not to. There are two reasons:

1) An even number of nodes cluster is not available at a higher risk, in the selection of the main process, there is a greater probability or equal votes, thus triggering the next round of elections.

2) An even number of node clusters do not work properly in some network segmentation scenarios. Imagine, when the network segmentation occurs, the cluster node is divided into half. The cluster will not work at this time. In accordance with the raft protocol, the cluster write operation does not allow the majority of nodes to agree, resulting in write failures and the cluster not functioning properly.

7. Node migration

In the production environment, it is unavoidable to encounter machine hardware failure. When a hardware failure occurs, we need to recover the node quickly. The ETCD cluster can migrate nodes without losing data and without changing the node ID.

The specific approach is:

1) Stop the ETC process on the node to be migrated;

2) Copy the data directory to the new node;

3) Update the node corresponding to the peer URL in the cluster to point to the new node;

4) Use the same configuration to start the ETCD process on the new node;

8.ETCD V2 and V3 differences

ETCD v2 and V3 are essentially two separate applications that share the same set of Raft protocol code, with different interfaces, different storage, and data isolation from one another. That is, if you upgrade from ETCD v2 to Etcd v3, the original V2 data can only be accessed with the V2 interface, and the data created by the V3 interface can only be accessed through the V3 interface. So we analyzed separately according to V2 and V3:

1) ETCD v2 storage, watch and expiration mechanism

ETCD v2 Storage structure diagram

ETCD v2 is a pure memory implementation, does not write data to disk in real time, the persistence mechanism is simple, is to serialize the store into JSON to write files. Data is a simple tree structure in memory

When the client calls the watch interface (adding the wait parameter to the parameter), if there is waitindex in the request parameter and Waitindex is less than Currentindex, the query from the Eventhistroy table is less than or equal to Waitindex. And the event that matches the watch key, if there is data, is returned directly. If the history table does not have or the request does not have waitindex, then put in Watchhub, each key will be associated with a watcher list. When there is a change operation, the event generated by the change is placed in the Eventhistroy table, and the watcher associated with the key is notified.

Here are a few details that affect the use of the problem:

1) Eventhistroy is limited in length, up to 1000. In other words, if your client stops for a long time and then re-watch, the event that might be associated with the Waitindex is eliminated, in which case the change is lost.

2) If the notification watch is blocked (100 buffer space per watch channel), ETCD will delete the Watcher directly, which will cause the wait request to disconnect, and the client needs to reconnect.

3) The expiration time is saved in each node of the ETCD store and is cleaned up by a timed mechanism.

Thus it can be seen that some limitations of ETCD v2:

1) The expiration time can only be set to each key, if more than one key to ensure that the life cycle of consistency is more difficult.

2) Watch can only watch one key and its child nodes (via parameter recursive), and cannot perform multiple watch.

3) It is difficult to use the watch mechanism to achieve complete data synchronization (risk of loss of changes), so most of the current usage is through watch to know the changes, and then retrieve the data through get, not entirely dependent on watch change event.

2) ETCD v3 storage, watch and expiration mechanism

ETCD V3 Storage structure diagram

ETCD v3 to take watch and store apart, we first analyze the store implementation.

The ETCD v3 store is divided into two parts, an in-memory index, Kvindex, which is based on a golang btree of Google Open source, and the other part is back-end storage. According to its design, backend can dock multiple stores, currently using the BOLTDB. Boltdb is a single-machine support for the KV storage of transactions, ETCD transactions are based on BOLTDB transaction implementation. ETCD the key stored in Boltdb is Reversion,value is ETCD own key-value combination, that is, ETCD will save each version in Boltdb, thus implementing a multi-version mechanism.

Reversion mainly consists of two parts, the first part main Rev, each transaction is added one, the second sub Rev, the same transaction each operation plus one. The first operation of the main Rev is 3, the second time is 4. Of course, the first problem that this mechanism is thinking about is the space problem, so ETCD provides commands and set options to control the compact, while supporting the put operation's parameters to precisely control the number of historical versions of a key.

Understanding the ETCD disk storage, you can see that if you want to query data from BOLTDB, you must pass reversion, but the client is the key to query the value, so ETCD memory Kvindex is the key and reversion to save the mapping relationship, Used to speed up queries.

Then we analyze the implementation of the watch mechanism. ETCD V3 's watch mechanism supports watch a fixed key, and also supports watch a range (watch that can be used to simulate the structure of a directory), so Watchgroup contains two watcher, one is key watchers, Data structure is each key corresponding to a set of watcher, the other is a range watchers, the data structure is a intervaltree (unfamiliar to the end of the text link), to facilitate the search through the interval to the corresponding watcher.

At the same time, each watchablestore contains two kinds of watchergroup, one is synced, the other is unsynced, the former indicates that the group's watcher data has been synchronized, waiting for a new change, The latter indicates that the group's watcher data synchronization lags behind current and recent changes and is still catching up.

When ETCD receives a watch request from the client, if the request carries the revision parameter, the current revision of the requested revision and store is compared, and if it is greater than the current revision, it is placed in the synced group, otherwise placed in the unsynced group. At the same time ETCD initiates a background goroutine that continuously synchronizes unsynced watcher, and then migrates it to the synced group. That being the case, ETCD V3 supports the issue of watch from any version, with no V2 limit to the 1000 historical event tables (which, of course, refer to the absence of the compact).

In addition, as we mentioned earlier, ETCD v2 when notifying the client, if the network is bad or the client reads slow, the blocking occurs, the current connection is closed directly, and the client needs to re-initiate the request. ETCD v3 in order to solve this problem, it specifically maintains a push-blocking watcher queue and retries in another goroutine.

ETCD V3 has also made improvements to the expiration mechanism, the expiration time is set on lease, and then key and lease are associated. This enables multiple keys to be associated with the same lease ID, allowing for a uniform expiration time and a batch renewal.

9. Some major changes in ETCD V3 compared to ETCD v2:

1) The interface provides RPC interface via GRPC and discards the HTTP interface of V2. The advantage is that the long connection efficiency increases obviously, the disadvantage is that the use is not as convenient as before, especially to the inconvenient maintenance long connection scene.

2) discarded the original directory structure, into a pure kv, the user can be simulated by the prefix matching mode directory.

3) value is no longer saved in memory, and the same memory can support storing more keys.

4) Watch mechanism is more stable, basically through the watch mechanism to achieve full synchronization of data.

5) (Bulk operation and transaction mechanism, user can implement ETCD v2 CAS mechanism through Bulk transaction request (Batch transaction support if condition judgment)

10.ETCD Precautions for use

1) ETCD Cluster initialization problem

If a node is not started when the cluster first initializes, a Error:Etcdserver:not capable error is reported when it is accessed through the V3 interface. For compatibility reasons, the default API version at cluster startup is 2.3, and only if all nodes in the cluster are joined to confirm that all nodes support the V3 interface, the cluster version is promoted to V3. This will only happen when the cluster is initialized for the first time, if the cluster is initialized, the node is hung up, or the cluster shuts down (the Cluster API version is loaded from the persisted data when the reboot is turned off), it will not be affected.

2) mechanism of ETCD read request

V2 Quorum=true, the read is done through raft, which by default is true via the CLI request.

V3–consistency= "L" when (default) read through raft, otherwise read local data. The SDK code is controlled by whether it is open: withserializable option.

Consistent read in the case, each read also need to walk once raft protocol, can guarantee consistency, but loss of performance, if there is a network partition, a few nodes of the cluster can not provide consistent read. However, if this parameter is not set, it is read directly from the local store, thus losing consistency. When you use it, be careful to set this parameter based on the scenario, and trade-offs between consistency and usability.

3) compact mechanism of ETCD

Etcd default does not automatically compact, you need to set the startup parameters, or the compact through the command, if the changes are frequently recommended settings, it will result in space and memory waste and errors. ETCD v3 Default Backend quota 2GB, if not compact,boltdb file size exceeds this limit, will be error: "Error:etcdserver:mvcc:database space exceeded", resulting in data Unable to write.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More