transferred from: http://blog.jqian.net/post/dynamo.htmlDynamo is a highly available distributed KV system developed by Amazon and has a proven application in the Amazon store's back-end storage. It features: Always writable (99.9% <300ms), and can be optimally configured according to demand (adjust the RWN model).
According to the CAP principle (consistency, availability, Partition tolerance), Dynamo is an AP system that only guarantees eventual consistency.
Three main concepts of Dynamo:
- Key-value:key is used to uniquely identify a data object, Value identifies the content of the data object, and can only read and write to the object through key.
- Node: Refers to a physical host. There are three main functional components (Request coordination), member and Fault detection (membership and failure detection) and local persistence (locally persistence engine). The underlying data persistence store typically uses the Berkeley DB TDS.
- Instance (instance): each instance is composed of a set of nodes, from the application layer, the instance provides IO functionality. The nodes on the instance can be located in different IDC to ensure disaster tolerance.
Data Partitioning (Partition)
Data partitioning is an important topic in distributed system, and Dynamo uses the variant of consistent hash to increase the concept of virtual node. Such an actual physical node is distributed over hundreds of virtual nodes on the ring. The benefits of this are:
- If a node is unavailable (fault or maintenance), the load of the node can be distributed evenly to other available nodes;
- If a node is re-usable, or a new node is added, the new node can accept the same amount of requests as the original node;
- The number of virtual nodes can be adjusted according to the capacity of the physical machine to ensure that the non-capacity models reach the corresponding load.
Data Replication (Replication)
For high availability, Dynamo also uses replicas, with the default number of replicas being 3. Dynamo replica is simple, when key through consistent hash hash to node A, Node a coordinator (coordinator) will automatically copy the data to the clockwise direction adjacent to its N-1 nodes, where n is the number of replicas.
Data version (Versioning)
Because there are multiple replicas, the write operation for each copy is dynamo acceptable until final consistency is reached, and it is done by marking a version number, which causes multiple versions of the same data object to appear at the same time in the system. Of course, this approach is better suited to Amazon's own shopping cart app, so that each user's changes to the shopping cart can be preserved.
In most cases, the new version will contain older versions, and the system itself can reconcile (syntactic reconciliation) to determine the final version. But people who have used the version management system know that version conflict is unavoidable, Dynamo also encounter this situation, at this time need to be put to the application layer to reconcile, the data of multiple branches forcibly merged (collapse) a version. The result of this version coordination, for the shopping cart application, the added goods will not be lost, but the deleted items may appear, for the shopping cart scene is acceptable.
Dynamo uses vector clocks to do version control to merge conflicts.
Read and write operations
Dynamo is a highly available system, and any node can accept the application layer's read and write operations at any time (failure-free). However, because there are multiple replicas, the read and write operations involve data consistency issues. To address this issue, Dynamo uses a consistency agreement similar to the legal quorum quorum.
The Quroum protocol has two configuration items:
- R the minimum number of nodes participating in a successful read operation
- W minimum number of nodes participating in a successful write operation
Quorum is to ensure that: W+r > N, the equivalent of the number of copies required to write successfully + read the number of successful copies required > the total number of replicas, the final consistency can be guaranteed. Official recommendations (N, R, W) = (3, 2, 2) to take into account the AP.
Fault handling (hinted handoff)
In the event of a temporary failure of a node, the data is automatically entered into the next node in the list for write operations, marked as handoff data, and re-pushed back when a notification is required to restore the original node. This can greatly improve the system's write success.
Handling permanent faults (Replica synchronization)
For faster detection of inconsistencies between replicas, Dynamo uses Merkletree. Merkletree is a hash value of the tree, each leaf node is the hash value of key, and then the middle node is the hash value of all son nodes, so that each child node changes will be reflected to the upper parent node. Using Merkletree to index data, as long as any data changes, will be quickly feedback, can speed up data changes in the search. This technology has been popularized in torrent-to-peer transmission.
Member and Fault detection
Gossip is a decentralized communication protocol that is often used in distributed non-strong consistency systems to synchronize the state of each node. Specifically, in a bounded network, each node periodically randomly initiates a gossip session, and after multiple rounds of communication, eventually all node states are agreed upon. It can be used to discover members and can also be used for fault detection.
Gossip has a number of specific implementations, and Daynamo uses anti-entropy implementations.
It is said that the early dynamo approach is similar to Corosync, which maintains a global view of the state of all nodes on each node.
Reference:
Dynamo Distributed System--"rwn" protocol solves how the multi-backup data reads and writes to ensure data consistency, and "vector clock" to ensure that when reading multiple backup data, how to determine which data is the most current situation