ETCD Architecture and Implementation analysis

Source: Internet
Author: User
Tags cas compact network function zookeeper etcd
This is a creation in Article, where the information may have evolved or changed.

Some time ago the project used the ETCD, so the study of its source code and implementation. Online about the use of ETCD introduction of many articles, but the analysis of the implementation of the specific framework of the article is not many, at the same time ETCD v3 documents are very scarce. This paper analyzes the architecture and implementation of ETCD, understand its advantages and disadvantages and bottleneck point, on the one hand can learn the architecture of distributed system, on the other hand can also ensure the correct use of ETCD in business, know it at the same time know its why, to avoid misuse. Finally, we introduce the tools around ETCD and some precautions for use.

Reading objects: Distributed systems enthusiasts, developers who are or are planning to use ETCD in their projects.

Etcd According to the official introduction

ETCD is a distributed, consistent Key-value store for shared configuration and service DISCOVERY

is a distributed, consistent key-value storage that is primarily used for shared configuration and service discovery. ETCD has been widely used in many distributed systems, the architecture and implementation of this article mainly answers the following questions:

    1. How does the ETCD achieve consistency?
    2. How is the storage of ETCD implemented?
    3. How is the watch mechanism of ETCD implemented?
    4. How is the ETCD key expiration mechanism implemented?

Why do I need ETCD?

All distributed systems are faced with a problem of data sharing between multiple nodes, and this is the same as teamwork, where members can work separately, but always need to share some necessary information, such as who is leader, which members are, and which are dependent on the order coordination between tasks. So the distributed system either implements a reliable shared storage to synchronize information (such as Elasticsearch) or relies on a reliable shared storage service, and ETCD is one such service.

What capabilities does ETCD offer?

ETCD mainly provides the following capabilities, readers who are already familiar with ETCD can skip this paragraph.

    1. Provides an interface for storing and retrieving data, which ensures strong consistency of data across multiple nodes in the ETCD cluster through protocols. Used to store meta information and shared configurations.
    2. Provide a listening mechanism, the client can listen to a key or some key changes (V2 and v3 different mechanisms, see the following article). For monitoring and push changes.
    3. The expiration and renewal mechanism of the key is provided, and the client is renewed by a timed refresh (the implementation mechanism of V2 and V3 is not the same). Used for cluster monitoring and service registration discovery.
    4. Provides atomic cas (Compare-and-swap) and CAD (Compare-and-delete) Support (V2 implemented through interface parameters, V3 through bulk transactions). For distributed locks and leader elections.

More detailed use of the scene is not described here, interested can be found in the text at the end of Infoq an article.

How does the ETCD achieve consistency?

Speaking of which, we have to talk about the raft agreement. However, this article is not a special analysis of raft, space constraints, can not be detailed analysis, interested in the proposal to see the text at the end of the original paper address and an animation of the raft protocol. Easy to look at the following article, I would simply make a summary:

    1. Raft designs different mechanisms for different scenarios (select master, log copy), although it reduces commonality (relative Paxos), but also reduces complexity, facilitates comprehension and implementation.
    2. Raft built-in selection of the main protocol is for itself, for the selection of the main node, the key to understand the raft of the main mechanism is to understand the raft clock cycle and time-out mechanism.
    3. The key to understanding ETCD data synchronization is to understand the raft log synchronization mechanism.

ETCD implementation raft, take full advantage of the Go language CSP concurrency model and Chan's magic, want to take a step closer to see the source code, here only simple analysis of its Wal log.

The Wal log is binary, and after parsing it is the above data structure logentry. Where the first field type, only two, one is 0 means that normal,1 represents Confchange (Confchange represents the ETCD itself configuration change synchronization, such as a new node join, etc.). The second field is the term, each of which represents the tenure of a master node, and each time the master node changes the terms change. The third field is index, which is a strictly ordered increment, representing the change sequence number. The fourth field is the binary data, which saves the PB structure of the raft request object. Etcd source under a tools/etcd-dump-logs, can be the Wal log dump into a text view, can help analyze the raft protocol.

Raft protocol itself does not care about application data, which is part of data, consistency is achieved through the synchronization of the Wal log, each node will receive from the master node data apply to the local storage, raft only concerned about the synchronization state of the log, if the local storage implementation of the bug, For example, if you do not correctly apply the data locally, it may also result in inconsistencies.

ETCD V2 and V3

ETCD v2 and V3 are essentially two separate applications that share the same set of Raft protocol code, with different interfaces, different storage, and data isolation from one another. That is, if you upgrade from ETCD v2 to Etcd v3, the original V2 data can only be accessed with the V2 interface, and the data created by the V3 interface can only be accessed through the V3 interface. So we analyzed them separately according to V2 and v3.

ETCD v2 storage, watch and expiration mechanism

ETCD v2 is a pure memory implementation, does not write data to disk in real time, the persistence mechanism is simple, is to serialize the store into JSON to write files. Data is a simple tree structure in memory. For example, the following data is stored in the ETCD structure.

/nodes/1/name  node1  /nodes/1/ip    

There is a global currentindex in the store, with each change, the index will add 1. Each event is then associated to the Currentindex.

When the client calls the watch interface (adding the wait parameter to the parameter), if there is waitindex in the request parameter and Waitindex is less than Currentindex, the query from the Eventhistroy table is less than or equal to Waitindex. And the event that matches the watch key, if there is data, is returned directly. If the history table does not have or the request does not have waitindex, then put in Watchhub, each key will be associated with a watcher list. When there is a change operation, the event generated by the change is placed in the Eventhistroy table, and the watcher associated with the key is notified.

Here are a few details that affect the use of the problem:

    1. The Eventhistroy is limited in length and lasts up to 1000. In other words, if your client stops for a long time and then re-watch, the event that might be associated with the Waitindex is eliminated, in which case the change is lost.
    2. If the notification watch is blocked (100 buffer space per watch channel), ETCD will delete the Watcher directly, which will cause the wait request to disconnect and the client needs to reconnect.
    3. The expiration time is saved in each node of the ETCD store and is cleaned up by a timed mechanism.

Thus it can be seen that some limitations of ETCD v2:

    1. The expiration time can only be set to each key, if more than one key to ensure that the life cycle of consistency is more difficult.
    2. Watch can only watch one key and its child nodes (via parameter recursive), and cannot perform multiple watch.
    3. It is difficult to use the watch mechanism to achieve complete data synchronization (risk of loss of change), so most of the current usage is through watch to know the changes, and then retrieve the data through get, not entirely dependent on watch change event.

ETCD v3 storage, watch and expiration mechanism

ETCD v3 to take watch and store apart, we first analyze the store implementation.

The ETCD v3 store is divided into two parts, an in-memory index, Kvindex, which is based on a golang btree of Google Open source, and the other part is back-end storage. According to its design, backend can dock multiple stores, currently using the BOLTDB. Boltdb is a single-machine support for the KV storage of transactions, ETCD transactions are based on BOLTDB transaction implementation. ETCD the key stored in Boltdb is Reversion,value is ETCD own key-value combination, that is, ETCD will save each version in Boltdb, thus implementing a multi-version mechanism.

For example: Write two records with ETCDCTL through the bulk interface:

Then update these two records via the bulk interface:

In fact, there are 4 data in Boltdb:

Reversion mainly consists of two parts, the first part main Rev, each transaction is added one, the second sub Rev, the same transaction each operation plus one. As the example above, the first operation of the main Rev is 3, the second time is 4. Of course, the first problem that this mechanism is thinking about is the space problem, so ETCD provides commands and set options to control the compact, while supporting the put operation's parameters to precisely control the number of historical versions of a key.

Understanding the ETCD disk storage, you can see that if you want to query data from BOLTDB, you must pass reversion, but the client is the key to query the value, so ETCD memory Kvindex is the key and reversion to save the mapping relationship, Used to speed up queries.

Then we analyze the implementation of the watch mechanism. ETCD V3 's watch mechanism supports watch a fixed key, and also supports watch a range (watch that can be used to simulate the structure of a directory), so Watchgroup contains two watcher, one is key watchers, Data structure is each key corresponding to a set of watcher, the other is a range watchers, the data structure is a intervaltree (unfamiliar to the end of the text link), to facilitate the search through the interval to the corresponding watcher.

At the same time, each watchablestore contains two kinds of watchergroup, one is synced, the other is unsynced, the former indicates that the group's watcher data has been synchronized, waiting for a new change, The latter indicates that the group's watcher data synchronization lags behind current and recent changes and is still catching up.

When ETCD receives a watch request from the client, if the request carries the revision parameter, the current revision of the requested revision and store is compared, and if it is greater than the current revision, it is placed in the synced group, otherwise placed in the unsynced group. At the same time ETCD initiates a background goroutine that continuously synchronizes unsynced watcher, and then migrates it to the synced group. That being the case, ETCD V3 supports the issue of watch from any version, with no V2 limit to the 1000 historical event tables (which, of course, refer to the absence of the compact).

In addition, as we mentioned earlier, ETCD v2 when notifying the client, if the network is bad or the client reads slow, the blocking occurs, the current connection is closed directly, and the client needs to re-initiate the request. ETCD v3 in order to solve this problem, it specifically maintains a push-blocking watcher queue and retries in another goroutine.

ETCD V3 has also made improvements to the expiration mechanism, the expiration time is set on lease, and then key and lease are associated. This enables multiple keys to be associated with the same lease ID, allowing for a uniform expiration time and a batch renewal.

Some major changes in ETCD V3 compared to ETCD v2:

    1. The interface provides RPC interface via GRPC and discards the HTTP interface of V2. The advantage is that the long connection efficiency increases obviously, the disadvantage is that the use is not as convenient as before, especially to the inconvenient maintenance long connection scene.
    2. Discarded the original directory structure, became pure kv, the user can be simulated by the prefix matching mode directory.
    3. Value is no longer saved in memory, and the same memory can support storing more keys.
    4. Watch mechanism is more stable, basically can realize the complete synchronization of data through the watch mechanism.
    5. Provides bulk operations and transaction mechanisms that enable users to implement the CAS mechanism of ETCD v2 through bulk transaction requests (if condition judgment is supported for bulk transactions).

Etcd,zookeeper,consul comparison

These three products are often used by people to do a selection comparison. ETCD and Zookeeper provide a very similar capability, both generic and consistent meta-information stores, which provide a watch mechanism for change notification and distribution, and are used by distributed systems as shared information stores, where the software ecosystem is located almost the same, and can be substituted for each other. In addition to the implementation of details, language, consistency agreement on the difference, the biggest difference in the surrounding biosphere. Zookeeper is Apache, written in Java, provides RPC interfaces that were first hatched from Hadoop projects and widely used in distributed systems (Hadoop, SOLR, Kafka, Mesos, etc.). ETCD is a CoreOS company's open source product, relatively new, with its easy-to-use rest interface and active community capture a group of users, in some new clusters to be used (such as kubernetes). Although V3 is also changed to a binary RPC interface for performance, its ease of use is better than Zookeeper. While Consul's goals are more specific, ETCD and Zookeeper provide distributed, consistent storage capabilities, and specific business scenarios require users to implement them themselves, such as service discovery, such as configuration changes. Consul is the primary target for service discovery and configuration changes, with KV storage included. In the software ecosystem, the more abstract the application scope of the components, but at the same time to the specific business scenarios need to meet the needs of certain shortcomings.

ETCD's Peripheral tools

  1. Confd
    In distributed systems, it is ideal for service Discovery/configuration center interactions, such as applications, directly and ETCD, to monitor ETCD for service discovery and configuration changes. But we still have many legacy programs, and the service discovery and configuration is mostly done by changing the configuration file. ETCD's own positioning is universal kv storage, so there is no mechanism or tool for implementing configuration changes like Consul, and CONFD is the tool used to achieve this goal.
    CONFD listens to ETCD changes through the watch mechanism, then synchronizes the data to one of its own local storage. Users can configure their own changes to focus on those keys and provide a profile template. CONFD Once a data change is found, use the latest data rendering template to generate the configuration file, replace it if the old and new configuration file changes, and trigger a user-supplied reload script that will allow the application to reload the configuration.
    CONFD equivalent to the implementation of some Consul agent and consul-template function, the author is kubernetes Kelsey Hightower, but the great God seems very busy, not too much time to pay attention to this project, long no release version, We are in a hurry, so we fork a copy of our maintenance, mainly add some new template functions and support for Metad back end. Confd
  2. Metad
    The implementation mode of service registration is generally divided into two kinds, one is the dispatching system on behalf of the registration, one is the application of its own registration. When the dispatch system is registered, the application needs a mechanism to let the application know "who I am" and then discovers its own cluster and its own configuration. Metad provides a mechanism by which the client requests a fixed interface/self of Metad, which informs the application of the meta-information it belongs to, simplifying the client's service discovery and configuration change logic.
    Metad does this by saving a mapping of IP-to-meta information paths, and the current backend supports ETCD v3, providing a simple and usable HTTP rest interface. It synchronizes ETCD data to local memory via the watch mechanism, which is equivalent to a proxy for ETCD. It can also be used as an agent for ETCD, which is suitable for the RPC interface that is inconvenient to use the ETCD v3 or the scene that wants to reduce the ETCD pressure. Metad
  3. ETCD Cluster One-click Build Script
    ETCD official that one-button build script has a bug, I myself compiled a script, through the network function of Docker, a key to build a local ETCD cluster for testing and testing. One-click Build Script

ETCD Precautions for use

    1. ETCD Cluster initialization problem
      If a node is not started when the cluster first initializes, a Error:Etcdserver:not capable error is reported when it is accessed through the V3 interface. For compatibility reasons, the default API version at cluster startup is 2.3, and only if all nodes in the cluster are joined to confirm that all nodes support the V3 interface, the cluster version is promoted to V3. This will only happen when the cluster is initialized for the first time, if the cluster is initialized, the node is hung up, or the cluster shuts down (the Cluster API version is loaded from the persisted data when the reboot is turned off), it will not be affected.
    2. Mechanism of ETCD Read request
      V2 Quorum=true, the read is done through raft, which by default is true via the CLI request.
      V3–consistency= "L" when (default) read through raft, otherwise read local data. The SDK code is controlled by whether it is open: withserializable option.
      Consistent read in the case, each read also need to walk once raft protocol, can guarantee consistency, but loss of performance, if there is a network partition, a few nodes of the cluster can not provide consistent read. However, if this parameter is not set, it is read directly from the local store, thus losing consistency. When you use it, be careful to set this parameter based on the scenario, and trade-offs between consistency and usability.
    3. The compact mechanism of ETCD
      Etcd default does not automatically compact, you need to set the startup parameters, or the compact through the command, if the changes are frequently recommended settings, it will result in space and memory waste and errors. ETCD v3 Default Backend quota 2GB, if not compact,boltdb file size exceeds this limit, will be error: "Error:etcdserver:mvcc:database space exceeded", resulting in data Unable to write.

Brain Cavity time

Automatically after the last Elasticsearch article, to arrange a job for themselves, each analysis of the source code needs to put forward a number of divergent thinking ideas, open a brain hole.

    1. Concurrent Code call Analysis Trace Tool
      The current IDE's code call analysis traces are implemented using static code analysis to track the method invocation chain, which is useful for reading analysis code. However, if the program fully uses the CSP or actor model, it is called through the message, and there is no explicit method call chain, which brings difficulty in reading and understanding the code. This should be useful if the language or IDE can support such message delivery tracking analysis. Of course I'm just a brain hole, not considering the likelihood and complexity of implementation.
    2. Implement a common multiple group raft library
      The raft implementation of the current ETCD guarantees the synchronization between multiple node data, but one obvious problem is that the expansion node cannot solve the capacity problem. To solve the capacity problem, only shards can be made, but how can I use raft to synchronize data after sharding? Only one multiple group raft can be implemented, and multiple copies of each shard form a virtual raft group, which enables data synchronization through raft. The current implementation of the multiple group raft has tikv and cockroachdb, but not yet an independent universal. Theoretically, if you have this set of multiple group raft, behind a persistent kv is a distributed kv storage, hang a memory kv is distributed cache, hang a lucene is a distributed search engine. Of course, this is only theoretically, to really achieve complexity is still not small.

ETCD's Open source product apocalypse

Etcd in the zookeeper has laid the status of the situation, just re-built a wheel, and in the ecological circle has made a place. On the one hand, it can be seen that the shape of the community in the change, communication mechanisms and response to user feedback more and more important, on the other hand can also be seen the importance of the ease of use of a project sometimes even higher than stability and functionality. New algorithms, new languages will give the opportunity to reinvent the wheel.

Question and answer of Gitchat Exchange Group

Q: What is the problem with ETCD v2 upgrading to V3 for business use, and how do I smooth transitions?

A: Most of the functions of V2, with V3 can be achieved, such as using prefix to simulate the original directory structure, with TXN analog CAs, there is generally no problem. But because V2 and V3 data are isolated from each other, it's a little cumbersome to move. It is recommended that you first encapsulate a layer in your business, encapsulate the differences in ETCD v2,v3, and switch between them.

Q: How does Metad's watch come true?

A: The Metad watch implementation is relatively straightforward, because Metad's watch returns not the Change event, but the latest result. So Metad only maintains a global version number, as long as the client watch version is found to be less than or equal to the global version number, it returns the latest results directly.

Q: Both ETCD and ZK are components that are managed as distributed configurations. All provide watch function, select the main. As a beginner, what's the choice between the two?

A: Etcd and ZK can be replaced in most cases, and are common distributed consistent kv storage. Choose between the choice of their own development stack is close and team members are more familiar with, such as a language selection, go language projects with the Etcd,java of ZK, the problem to see the source code is also easier. If it is a new project, tangled in both, that can be divided into a layer of lib, similar to DOCKER/LIBKV, while supporting two, there is a need to switch.

Q: The similarities and differences between ETCD and Eureka and consul, as well as their respective application scenarios, as well as the selection principles. This problem can actually be included in the ZK, these are all the same.

A: The selection of ETCD and ZK mentioned above, the positioning of both is universal consistent kv storage, and Eureka and consul positioning is dedicated to service registration and discovery. The advantages of the former are of course universal, widely used, deployment and operation of the time is easy to share with the existing services, while the shortcomings are too general, each application of service registration has its own set of metadata format, integration with each other is more troublesome, such as to make a generic API Gateway will encounter metadata format compatibility issues. This also becomes the advantage of the latter. At the same time because the goal of the latter is more specific, so you can do some more advanced functions, such as consul DNS support, consul-template tools, Eureka event subscription filtering mechanism. The implementation of the EUREKA itself is an AP system, which means sacrificing consistency, and it believes that availability and partition fault tolerance are more important than consistency in the service discovery and Configuration center scenario. I personally look forward to the latter two of this special solution, if you can form a service registration standards, then the application of mutual interaction between the easy. But there is also the possibility that this standard will be formed by a cluster scheduling system to form a fact standard.
After the two I know not deep, feel can be another article.

Q: On top of that, ETCD and ZK each have their own pits that may be trampled on and how many pits they have. How do you get up when you fall in?

The concept of this pit is quite extensive and can be turned into a list of bugs in more detail. However, most of the pits used are generally available in several ways:

    1. The pits caused by misuse. To first understand the location of ETCD,ZK, it needs to save the entire cluster of shared information, can not be used for storage. For example, someone in a ZK a data node created a large number of sub-nodes, and then obtain, resulting in ZK error, ZK buffer has a 4MB limit, more than will be error.
    2. Operation and maintenance of the pit. ETCD,ZK This kind of service, generally is relatively stable, set up after all without the tube, but in case some node problems, to increase the node recovery system, there may be no plan or operation experience, resulting in broken cluster.
    3. Network partitioning and the availability of design pits. When designing the system, make sure that if ETCD or ZK is hung up, or if there is a network partition, some of the nodes of the application can only be connected to the minority ETCD/ZK (minority is not available), the application will have what performance. In this case, the correct performance of the application should be the normal operation of the service, but does not support the change, and so on after the ETCD/ZK cluster recovery is automatically restored. But if the design is inappropriate, there are some automated behaviors that can lead to a large failure.

Want to tread less, one way is that I mentioned in the text, the study of the principle of knowing its why, and another problem is more testing, out of the problem has a plan.

Q: A few problems of an experimental hardware cluster project We implemented the arm-based distributed interconnect hardware cluster (method reference is Trying-etcd-on-android-mac-and-raspberry-pi/comment-page-1/the ETCD on the Arm Development Board), Use ETCD as a distributed database (but ETCD itself runs on top of these hardware), and then refer to Go-rpio Implement ETCD-based Key-value synchronization hardware information to control certain gpio.

Issue 1: Currently known ETCD can provide service discovery for other services, in this scenario, assume that there are already 5 running ETCD node hardware, when a new ETCD hardware node is installed, ETCD can provide service discovery service for themselves, to achieve automatic discovery and join ETCD node?

Question 2: With the increase in hardware installation, ETCD limit is how much, raft will be due to the number of nodes, Heartbeat packet round-trip and lead to synchronization of the waiting time longer?

3: When the network partition is large enough, is it impossible to synchronize the data between the smaller batches of hardware?

A: This case is very interesting, I answer one by one.

    1. ETCD originally is to do service discovery, if the ETCD cluster also need service discovery, then a ETCD cluster:). You can build yourself a ETCD cluster or use ETCD's official See more: The official op-guide/clustering of ETCD
    2. The ETCD mechanism is multi-node consistent, so its limit has two parts, one is the capacity limit of the single machine, memory and disk. The second is the network overhead, every raft operation needs all nodes to participate, the more nodes the more performance is lower. So it makes no sense to extend many ETCD nodes, which is generally 3,5,7,9. It doesn't make sense to feel any more. If you don't care too much about consistency, it is recommended that read requests can read node-local data without the consistency protocol. Specific ways are described in this article.
    3. When ETCD a network partition, the minority is unavailable, does not support raft requests, but supports non-conforming read requests.

Q: If you deploy services across rooms, are you deploying two sets of ETCD? How do I deploy and configure a cross-room deployment?

Answer: This depends on the scene of the cross-room. If it is completely unrelated to the need for a public network connection of two rooms, services generally do not need to share data it? Deployment of two sets of unrelated etcd, each with a more appropriate. But if it is similar to the concept of AWS availability zone, two rooms intranet interoperability, set up two sets of clusters in order to avoid engine room failure, can switch at any time. This ETCD currently does not have a good solution, the recommended approach is to deploy a ETCD cluster across the availability zone, adjust the heartbeat and election time-out, this method if there are 3 availability zone room, 3 nodes per room, hanging any one room does not affect the entire cluster, but two room is more awkward. There is also a way to synchronize between two clusters, this etcdv3 provides a mirror tool, but still not perfect, but feel with ETCD watch mechanism to do a synchronization tool is not difficult. This mechanism consul is provided, the multi-data center of the cluster data synchronization, does not affect the availability of each other.

Q: Are there any measures that can help reduce the presence of surprise group (herd Effect) during the use of Etcd watch?

A: I also encountered this problem, but did not find a good way, except in the client do a random delay. (Note: This issue later and CoreOS Lee rang communication, he said etcd3.1 will have the solution)


    1. Raft website has a paper address and related information.
    2. Raft animated demo See this animation will understand raft.
    3. Interval_tree
    4. ETCD: An all-round interpretation from application scenarios to implementation principles this article is a comprehensive description of the usage scenario.
    5. CONFD our modified version of the CONFD warehouse address.
    6. Metad Warehouse address.
    7. ETCD Cluster One-click Build Script
    8. The pain of concurrency thread,goroutine,actor my article on the concurrency model is helpful to understand the CSP model mentioned in the article.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.