General thought and technical summary of distributed system

Source: Internet
Author: User

First, the difficulty of the distributed system

What are the difficulties in distributed systems compared to single-machine systems?

1. Network factors

Since services and data are distributed on different machines, each interaction needs to run across machines, which poses several problems:

1. Network latency: Performance, timeout

The same computer room network IO is still a comparison block, but across the computer room, especially across the IDC, network IO has become a performance bottleneck can not be ignored. And, the delay is not bandwidth, bandwidth can be arbitrarily increased, gigabit NIC switch to million, only the cost of the problem, but the delay is the physical limit, the basic can not be reduced.

The problem is that the overall performance of the system is reduced, resulting in a series of problems, such as the locking of resources, so the system calls to set a timeout time for self-protection, but the excessive delay will bring the system RPC call timeout, causing a headache: the three-state result of distributed system call: Success , failed, timed out. Don't underestimate this third State, which is almost the source of the complexity of all distributed systems.

There are some corresponding solutions to this problem: Async, failed retry. And for the cross-IDC data distribution brought about by the huge network factors, the general will use data synchronization, proxy dedicated line and other processing methods.

2. Network failure: Packet loss, disorderly sequence, jitter.

This can be solved by establishing a service on a reliable transport protocol, such as the TCP protocol. But it brings more network interaction. Therefore is a trade off of performance and traffic. This is more to consider in the mobile Internet.

2. The--cap law of the fish and the paw

Cap theory is one of the most important theories in distributed systems proposed by Eric Brewer:

    1. Consistency:[Strong] Consistency, transaction assurance, acid model.
    2. availiablity:[High] availability, redundancy to avoid a single point, at least flexible available (service degraded).
    3. Partition tolerance:[High] scalability (partition tolerance): The system is generally required to scale automatically on demand, such as hbase.

The CAP principle tells us that these three factors can only meet the maximum of two, it is impossible to balance the three. For distributed systems, partition fault tolerance is the basic requirement, so it is necessary to abandon consistency. For large web sites, partition fault tolerance and availability are more demanding, so the general choice is to discard the consistency appropriately. In response to the CAP theory, NoSQL pursues the AP, while the traditional database pursues the CA, which also explains why the traditional database is limited in its ability to scale.

In the cap three, "extensibility" is the unique nature of distributed systems. The design of distributed system is to use the ability of multi-machine cluster to deal with the problem that can not be solved. When it comes to scaling system performance, one approach is to optimize the performance of the system or upgrade the hardware (scale up), and one approach is to "simply" increase the size of the system by scaling out the machine. A good distributed system is always pursuing "linear scalability", where performance can grow linearly with the number of clusters.

Availability and scalability are generally associated, and the scalability of a good system is generally high, because there are multiple service (data) nodes, not a single point of the whole. So all the problems of distributed systems are basically a coordination and balance between consistency and availability and scalability. For systems with no status, there is no consistency problem, and according to the CAP principle, their availability and partitioning tolerance are high, and simply adding machines allows for linear scaling. For stateful systems, the accesses than either of the cap must be sacrificed in terms of business needs and characteristics. Generally speaking, the business of the trading system has a higher requirement for consistency, and the acid model is used to ensure the strong consistency of the data, so its usability and expansibility are poor. Most of the other business systems generally do not need to ensure strong consistency, as long as the eventual consistency, they generally use the base model, with the idea of final consistency to design a distributed system, so that the system can achieve high availability and scalability.

The CAP law is also an important indicator for measuring distributed systems, and another important indicator is performance.

Consistency model

There are three main types:

    1. Strong consistency (strong consistency): Once the new data is written, the new value can be read at any time in any copy. For example: File system, Rdbms,azure table are strong consistency.
    2. Week consistency (weak consistency): The values on different replicas are new and old, requiring the application to do more work to get up-to-date values. Like Dynamo.
    3. Evantual Consistency (final consistency): Once the update is successful, the data for each copy will eventually be consistent.

From these three consistent models, we can see that weak and eventually are generally asynchronous redundant, while strong is generally synchronous (multi-write), and asynchronous usually means better performance, but it also means more complex state control. Synchronization means simplicity, but it also means performance degradation.

and other variants:

    1. Causal consistency (causality): If process a notifies process B that it has updated the data, subsequent reads of process B read the most recent value of a write, and C, which has no causal relationship with a, can eventually be consistent.
    2. Read-your-writes Consistency (read your written consistency): If process a writes the most recent value, then process A's subsequent operations will read to the latest value. But it may take a while for other users to see it.
    3. Session consistency: Once a value is read within a session, the older value is not read.
    4. Monotonic Read consistency (monotonic consistency): Once a user reads a value, it does not read a value older than that, and other users are not necessarily.

Wait a minute.

The most important variant is the second one: read-your-writes consistency. Especially suitable for data update synchronization, the user's changes are visible to themselves immediately, but other users can see his old version. Facebook's data synchronization is based on this principle.

Ii. common technology and application scenarios of distributed systems
    • Consistent hashing [with virtual node]: Consistent hash, data distribution
    • Vector clocks: Clock vector, multi-version data modification
    • Quorum w+r>n [with vector clock]: drawer principle, another solution for data consistency. Clock vector, multi-version data modification.
    • Merkle tree [with anti-entropy]: Data replication
    • Mvcc:copy-on-write and Snapshot
    • 2pc/3pc: Distributed Transactions
    • Paxos: Strong Consistency protocol
    • Symmetry and decentralization: symmetry and de-centering. Symmetry (symmetry) simplifies the configuration and maintenance of the system. De-centering is the extension of symmetry, which avoids the master single point and facilitates the cluster scale out.
    • Map-reduce: Divide and conquer; Mobile data is less than mobile computing. Dispatch the calculation to a compute node on the same physical machine as the storage node, which is called a localized calculation. Localization computation is an important optimization of computational scheduling.
    • Gossip Protocol: node management
    • Lease mechanism:
Consistent hashing: Consistent hashing to solve data equilibrium distribution problems

The hash algorithm we usually use is hash () mod n, but it is not possible to quickly switch to another node if one of the nodes fails. To solve the problem of single point of failure, we add an alternate node for each node, and when a node fails, it automatically switches to the standby node, similar to the database's master and slave. However, it is still not possible to solve the problem of hash redistribution after adding or deleting nodes, that is, the node cannot be deleted dynamically. At this point, the concept of consistent hash is introduced, all the nodes are distributed to a hash ring, each request falls on the hash ring somewhere, only need to follow the clockwise direction to find the first node, is the service node of their own needs. When a node fails, only the next available node can be found on the ring.

Consistent hash algorithms are most commonly used in distributed caches, such as the memcached of attention. Dynamo is also used as a data distribution algorithm, and the consistency algorithm is improved, and a new algorithm based on virtual node is proposed, the core idea is to introduce virtual node, each virtual node has a corresponding physical node, and each physical node can correspond to several virtual nodes.

For more information about consistent hashing, you can refer to the author's other blog post: memcached's distributed algorithm learning.

This article can also look at: some problems of consistency hashing in a distributed application practice

Virtual node

As mentioned earlier, some consistent hashing implementation method adopts the idea of virtual node. With the general hash function, the distribution of the server map location is very uneven. Therefore, using the idea of a virtual node, assign 100~200 points to each physical node (server) on the continuum. This can suppress uneven distribution and minimize cache redistribution when the server is increasing or decreasing.

Quorum W+r>n: Drawer principle, another solution for data consistency

N: The number of nodes copied, that is, the number of copies of the data being saved. R: The minimum number of nodes for a successful read operation, which is the number of copies required for each read success. W: The minimum number of nodes for a successful write operation, which is the number of copies required for each write.

So w+r>n means: for a distributed system with N copies, write to W (W<=n) to succeed in writing, read R (r<=n) data to read successfully.

These three factors determine availability, consistency, and partition fault tolerance. W+r>n can guarantee the consistency of data (C), the greater the data consistency of W greater. The NWR model gives the choice of the cap to the user, allowing the user to weigh the functionality, performance, and cost benefits themselves.

For a distributed system, n is usually greater than 3, which means that the same data needs to be stored on more than three different nodes to prevent a single point of failure. W is the minimum number of successful write operations, where the write success can be understood as "synchronous" write, such as N=3,w=1, then as long as the success of a node can be written, the other two copies of the data is asynchronous way. R is the minimum number of nodes for a successful read operation, why read multiple copies of data? In distributed systems, data may be inconsistent on different nodes, we can choose to read different versions on multiple nodes to achieve the purpose of enhancing consistency.

Some of the settings of the NWR model can cause dirty data and version conflicts, so it is common to introduce the vector clock algorithm to solve the problem.

It is necessary to ensure that Max (N-W+1,N-R+1) nodes are available in the system.

Regarding the NWR model, it is recommended to read the transaction processing of the distributed system, which is very easy to write.

Vector clocks: Clock vector, multi-version data modification

See the transactional processing of distributed systems, which is very easy to read.

Lease mechanism

Chubby, zookeeper obtains the lease (lease) The node obtains the system pledge: in the validity period data/node role and so on is valid, does not change.

Features of the lease mechanism:

    • The lease issuance process requires only the network to communicate in one direction, and the same lease can be repeatedly sent to the receiver by the issuer. Even if the issuer occasionally fails to send lease, the issuer can simply be resolved by means of a re-send.
    • Machine downtime has little effect on the lease mechanism. If the issuer goes down, the issuer of the outage usually cannot change the previous commitment and will not affect the correctness of the lease. After the issuer recovers, if the issuer recovers the previous lease information, the issuer can continue to comply with lease's commitment. If the issuer is unable to recover the lease information, then just wait for a maximum lease timeout to invalidate all lease, thus not destroying the lease mechanism.
    • The lease mechanism relies on the validity period, which requires the clock of the issuer and receiver to be synchronous.
      • If the issuer's clock is slower than the receiver's clock, the issuer still considers the lease to be valid when the recipient thinks that the lease has expired. The recipient can resolve this issue by applying for a new lease before the lease expires.
      • If the issuer's clock is faster than the receiver's clock, then when the issuer considers that the lease has expired, the lease may be issued to other nodes, causing the commitment to expire and affect the correctness of the system. For this clock to be out of sync, it is common practice to set the issuer's validity period to be slightly larger than the receiver's, and to avoid the effect of lease on the effectiveness of the clock by simply being over-clocked.

In engineering, the often selected lease time is a 10-second level, which is a validated empirical value that can be used as a reference in practice and a comprehensive selection of the appropriate length of time.

Double-Master problem (brain fissure problem)

Lease mechanism can solve the problem of "double-master" caused by the problem of network partitioning, that is, the phenomenon of "brain fissure". The configuration center issues a lease for a node, which indicates that the node can work as a primary node. When the configuration center discovers that primary has a problem, just wait until the previous primary lease expires, and you can safely issue a new lease to the new primary node without a "double-master" issue. In the real system, it is very risky to send lease with a central node as the configuration center. The actual system always uses multiple central nodes to copy each other and becomes a small cluster, which has high availability and provides the function of issuing lease. Chubby and zookeeper are based on this design.

Chubby generally have five machines to form a cluster, can be deployed into a two-room three. Five machines inside the chubby need to select a chubby master machine through the Paxos protocol, and the other machines are chubby slave, with only one chubby master at a time. Chubby related data, such as lock information, the client session information and so on need to synchronize to the entire cluster, using a semi-synchronous approach, more than half of the machine successfully can reply to the client. Finally, you can ensure that only one and the original chubby master remain fully synchronized chubby slave is selected as the new Chubby master.

Gossip protocol

Gossip is used for autonomous nodes in peer-to system to learn about clusters (such as node status, load situation, etc.) of the cluster. Nodes in the system regularly gossip about each other, and soon gossip spreads across the system. A, b Two nodes gossip is mainly: a tells B who knows what gossip; B tells A that the gossip B knows which updates; b update A tells him the gossip ... Said to be an autonomous system, in fact, there are some seed nodes node. The role of seed nodes is mainly embodied when new nodes are added to the system. The new node joins the system, first gossip with the seed node, the new node obtains the system information, the seed node knows the new node in the system. Other nodes regularly gossip with the seed node and know that a new node has joined. Each node in the process of gossip, if it is found that the state of a node has not been updated for a long time, it is considered that the node has been down.

Dynamo uses the gossip protocol to do membership and fault detection.

2PC, 3PC, Paxos Protocol: Solutions for distributed transactions

Distributed transactions are difficult to do, so unless necessary, the eventual consistency is generally used to circumvent distributed transactions.

At present, only Google's system of the bottom-level NoSQL storage System realizes distributed transaction, it develops a system megastore in Java language above BigTable, realizes two-stage lock, and avoids the problem of two-stage lock coordinator's outage by chubby. Megastore implementation at present only a brief introduction, there is no relevant papers.

2PC

Simple to implement, but inefficient, all participants need to block,throughput low; no fault tolerance, one node fails the entire transaction fails. If the participant does not receive a decision in the second order after the first phase is completed, the data node will be "overwhelmed" and the state will block the entire transaction.

3PC

Improved version of the 2PC, the first segment of 2PC break into two paragraphs: ask, and then lock the resources, and finally really submitted. The core concept of 3PC is that it does not lock in resources when asked, unless everyone agrees to start locking resources.

The advantage of 3PC than 2PC is that if the node is in the P State (Precommit) when the fail/timeout problem occurs, 3PC can continue to direct the state into the C State (Commit), and 2PC is at a loss.

However, 3PC implementations are difficult and cannot handle network separation issues. If the PRECOMMIT message is sent after two room disconnects, this time coordinator is located in the room will abort, the remaining participant will commit.

Paxos

The purpose of the Paxos is to have the nodes of the entire cluster agree on a value change. Paxos algorithm is a kind of consistency algorithm based on message passing. The Paxos algorithm is basically a democratically elected algorithm-most decisions will be a unified decision for the entire cluster.

Any point can propose a proposal to modify a certain data, whether or not this proposal depends on whether there is more than half of the node's consent in the cluster (so the Paxos algorithm requires that the nodes in the cluster be singular). This is the Paxos relative to the 2PC and 3PC the biggest difference, in the 2f+1 node of the cluster, allow F nodes are not available.

In addition to ensuring consistency in data changes, Paxos's distributed democratic approach to elections is often used for single-point switching, such as the master election.

The Paxos protocol is characterized by difficulty, both understanding and implementation:(

For 2pc,3pc and Paxos, it is highly recommended to read transactional processing of distributed systems.

At present, most payment systems are actually self-improvement on the basis of 2PC. In general, a mistake processor is introduced to coordinate error (rollback or failure handling).

MVCC: Multi-version concurrency control

This is an important implementation mechanism for many RDMS storage engines to achieve high concurrency modifications. For details, refer to:

    1. The application of multi-version concurrency control (MVCC) in Distributed system
    2. MVCC (Oracle, Innodb, Postgres). pdf
Map-reduce thought 1. Divide and conquer 2. Moving data is less than mobile computing

If the compute nodes and storage nodes are located on different physical machines, the computed data needs to be transmitted over the network, which is expensive. Another idea is to dispatch the calculation to a compute node on the same physical machine as the storage node, which is called the localization calculation. Localization computation is an important optimization of computational scheduling.

Classic papers and Distributed system learning DYNAMOHBASELSM Tree
    • LSM (Log structured Merge Trees) is a B + tree an improvement
    • Partial read performance is sacrificed to significantly improve write performance
    • Idea: Split the tree
      • First write the Wal, then record the data into the memory, build an ordered subtree (memstore)
      • As the subtree grows larger, the subtree of memory is flush to disk (StoreFile)
      • Read data: All ordered subtrees must be traversed (unknown data in which subtrees tree)
      • Compact: The background thread merges the subtree in the disk into a tree (the subtree reads slowly)

In fact, Lucene's indexing mechanism is similar to HBase's LSM tree. It is also written in separate segment, the background of the segement merge.

Reference documents
    1. A ramble on NoSQL
    2. Data distribution design of multi-IDC (I.)
    3. transactional processing of distributed systems
    4. Four-stand-alone transaction processing of mass storage series
    5. Some of my technical aspects of sharing the collection
    6. Learning from Google Megastore (Part-1)

Http://blog.arganzheng.me/posts/thinking-in-distributed-systems.html

General thought and technical summary of distributed system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.