Principles of Distributed Systems

Source: Internet
Author: User
I. Important Points of the distributed system:
    1. A stateless node is provided externally, and the specific stateful or stateless node logic is implemented internally. A node can provide services or store data.
    2. The Byzantine problem is used in a distributed system to ensure service availability, rather than to find the wrong node, if.
    3. Common exceptions: machine downtime, network exceptions, message loss, disordered messages, data errors, and unreliable TCP. It may be due to a network exception when the message is received, or when the processing is complete, or when the processing is completed, the machine is down and the confirmation message is sent after the processing is completed. It may also be the loss of the sent message or the loss of the sent confirmation message. The data may be sent first and then received
    4. Distributed status, success, failure, and timeout. In case of timeout, you cannot determine whether the request is successful.
    5. Data is stored on a mechanical hard disk, and exceptions may occur at any time, resulting in data not being properly stored.
    6. Exceptions that cannot be classified, such as high processing power, low processing power, and strange behavior.
    7. Even a small probability event increases the computing workload of millions, millions, or more every day.
    8. Replicas increase data redundancy and improve system availability. However, using replicas also results in costs for maintaining replicas. For example, if the replicas are consistent, multiple replicas may be inconsistent directly.
    9. Consistency level: Strong Consistency, monotonous consistency, read the latest data, session consistency, and read the unified value through the version. Eventual consistency and weak consistency.


    1. Performance indicators of Distributed Systems: throughput, response latency, and concurrency. Common Unit QPS, that is, the processing capacity per second. High Throughput results in low response and mutual constraints.
    2. Availability indicators: the service time and non-service time ratio, and the number of request successes and failures are measured.
    3. Scalability indicators: achieve horizontal scalability. By adding low-configuration machines, you can achieve more computing and higher processing capabilities.
    4. Consistency indicator: to achieve consistency between replicas, consistency must be strictly considered whether the business is allowed.
Ii. Principles of distributed systems:

1. Hash, which maps different values to different machines or nodes. When considering redundancy, You can map multiple hash values to the same place. The implementation method of hash. The remainder is obtained. It is difficult to implement expansion. Data is scattered among many machines, and data needs to be obtained from one machine during expansion. In addition, uneven distribution may occur.

Common hashing: IP address, URL, ID, or fixed value are used for hashing, and the same result is always obtained.

2. Distribution by data range, for example, the ID ranges from 1 ~ 20 on machine A, the ID is 21 ~ 40 is on machine B, and the ID is 40 ~ 60 is implemented on machine C, and the ID ranges from 60 to 60 ~ 100 of the data is evenly distributed on machine D. If a node has limited processing capabilities, it can be split directly. Maintaining metadata of data distribution may cause single point of failure. Thousands of machines and each machine is divided into N ranges. As a result, the metadata of the data distribution range to be maintained is too large, and several machines may be required for implementation.

The amount of metadata must be strictly controlled to reduce the storage of metadata.

3. Distribution by data volume. Another common data distribution method is distribution by data volume. Different from the hash mode and the data range mode, the data volume distribution data is independent of the specific data features. Instead, the data is regarded as a file that increases sequentially, the file is divided into several chunks based on a fixed size, and different data blocks are distributed to different servers. Similar to data distribution by data range, data distribution by data volume also needs to record the specific distribution of data blocks and manage the distribution information as metadata using the metadata server.

Because it has nothing to do with the specific data content, data distribution by data volume is generally not prone to data skew, and data is always evenly split and distributed to the cluster. When the cluster needs to re-load balancing, you only need to migrate data blocks. Cluster expansion is not limited. You only need to migrate some databases to the new machine to complete the expansion. The disadvantage of dividing data by data volume is that you need to manage more complex metadata. Similar to data distribution by range, when the cluster size is large, the amount of metadata data also increases, efficient metadata management becomes a new topic.

4. consistent hash: Construct a hash ring with a hash field []. Then, construct three parts) /is divided into three parts. These three parts are a ring. When a machine is added, the node nearby it is changed and the pressure on nearby nodes is shared, metadata maintenance is the same as data volume distribution. Its future expansion can achieve multiple nodes.

5. Create ing metadata and create ing tables.

6. Copy and data distribution: distribute one data copy to multiple servers. For example, the data of application a is stored on machines A, B, C, and 3. If one of the three machines has a problem, the request is processed on the other two machines, if the machine is added for recovery, the data needs to be copied from the other two machines, increasing the burden on the two machines. If we have applications a and B, and each has three machines, we can scatter application a on six machines, and application B on six machines, the same data backup can be implemented, but the data stored by the application is dispersed. If a machine suffers damage, the load of the machine is evenly distributed to the other five machines. Data Recovery is performed from five machines. The speed is fast and the pressure on each server is not great. In addition, machine damage can be achieved. Replacement does not affect applications at all.

The principle is that multiple machines are copies of each other, which is an ideal way to achieve load segmentation.

7. The concept of distributed computing is that mobile data is not as good as mobile computing, so we should proceed to the Computing Principle to reduce the implementation of large cross-process, cross-network, and other spans and keep the computing resources as close as possible. This may cause network and remote machine bottlenecks.

8. common distributed system data distribution methods: GFS and HDFS: distribution by data volume; Map reduce is localized by GFS data distribution; bigtable and hbase are distributed by data range; pnuts are distributed by hash or data range. You can select dynamo and cassndra by consistent hash, Mola, armor, and bigpipe by hash, and Doris by hash and data volume distribution.

Iii. data copy Protocol

1. Replicas must meet certain availability and consistency requirements, specific fault tolerance capabilities, and provide reliable services even if there are some problems.

2. Two basic copy control protocols: centralized and decentralized.

3. The basic idea of the centralized copy control protocol is that a central node coordinates the update of copy data and Maintains consistency between copies. The advantage of the centralized copy control protocol is that the Protocol is relatively simple, and all replica-related controls are handed over to the central node. Concurrency control is implemented by the central node, which simplifies a Distributed Concurrency Control problem into a single-host concurrency control problem. The control problem is simplified to a single-host concurrency control problem. The so-called concurrency control, that is, when multiple nodes need to modify copy data at the same time, it is necessary to solve "write", "read and write" and other concurrency conflicts. In standalone systems, concurrency control is usually implemented through locking. Locking is also a common method for Distributed Concurrency Control. If no central node is used for unified lock management, a fully distributed lock system is required, which makes the Protocol very complex. The disadvantage of the centralized copy control protocol is that the system availability depends on the centralized node. When the central node is abnormal or the communication with the central node is interrupted, the system will lose some services (usually at least the Update Service ), therefore, the disadvantage of the centralized copy control protocol is that there is a certain service downtime. That is, there is a single point of failure. Even if a centralized node is a cluster, it is only a large single point of failure.

4. replica Data Synchronization FAQ: 1) network exception, resulting in no data from the replica; 2) Dirty Data Reading, the master node data has been updated, but for some reason, the latest data is not obtained; 3) when the new node is added, no data is obtained from the master node. when the data is read from the new node, no data is obtained.

5. The decentralized copy control protocol does not have any central node. All nodes in the Protocol are completely equivalent, and the nodes are consistent through equal negotiation. As a result, decentralized protocols do not cause service stop problems due to exceptions of centralized nodes. However, nothing is perfect. The biggest drawback of decentralized protocols is that the Protocol process is usually complicated. Especially when decentralized protocols require strong consistency, the Protocol process becomes complex and difficult to understand. Due to the complexity of the process, the efficiency and performance of decentralized protocols are low.

6. paxos is the only highly consistent decentralized copy Control Protocol applied in projects. Zookeeper and chubby are the applications of the Protocol.

Zookeeper selects the leader using the paxos protocol and uses the lease protocol to control whether the data is valid. Use the quorum protocol to synchronize leader data to follow.

When zeekeeper is used for quorum writing, if not all data is successfully written, all follow machines write data to the leader in reverse mode. After the data is written, follow synchronizes data to the leader again to ensure consistency, if it is a failure, data is written first, and follow is synchronized to the original data, relative to rollback. If it is the latest data that is first written to the leader, the latest data is updated.


7. External Store uses the improved row paxos protocol.

8. Dynamo/Cassandra uses a decentralized protocol based on consistent hash. Dynamo uses the quorum mechanism to manage copies.

9. The lease mechanism is the most important distributed Protocol and is widely used in various distributed systems. 1) lease is generally defined as an agreement in which the issuer gives certain rights to the holder within a certain period of time. 2) Lease expresses the issuer's commitment within a certain period of time. As long as the issuer has not expired, it must strictly abide by the commitments agreed by lease; 3) the lease holder uses the issuer's commitment within the period of time, however, once the lease expires, it must be used up or renewed with the issuer. 4. The meaning of the lease sent by the central server is: within the validity period of the lease, the central server ensures that the corresponding data value is not modified. 5) the lease certificate may be invalidated by version number, excessive time, or a fixed time point.

The principle is the same as our cache, such as the browser cache. It requires time clock synchronization because the data is completely dependent on the time limit.

10. heartbeat detection is not reliable. If the heartbeat and q are detected on machine A, q may initiate detection, but a's response is blocked, which leads to Q's perception that A is down, blocking is very fast recovery, leading to unreliable judgment based on heartbeat detection; it may also be caused by network disconnection between them; or it may be caused by exceptions of machine Q itself that machine A is down; based on the Q detection results, it is likely that multiple hosts may occur.

11. write-all-read-one (Waro for short) is the simplest copy control rule. As the name suggests, all copies are written when an update is made, and only those copies are updated successfully, to ensure that all copies are consistent, so that the data on any copy can be read during data reading. Write multiple copies and read from one of them.

12. The quorum protocol actually means that the number of successfully read replicas is greater than the number of failed replicas. The number of copies you read must contain the latest ones.

13. All copies of Mola * and armor * systems are managed based on quorum, that is, data is successfully updated on most copies.

14. The copy Management in big pipe * also adopts the Waro mechanism.



Iv. Log Technology

1. Log technology is one of the main technologies for recovery from downtime. Log technology was initially used in the database system. Strictly speaking, log technology is not a distributed system technology. However, in the practice of Distributed Systems, log technology is widely used for downtime recovery, even systems such as bigtable store logs in a distributed system, further enhancing the system's fault tolerance capability.

2. Two useful log technologies: redo log and no redo/No UNDO log.

3. database logs include UNDO log, redo log, redo/UNDO log, and no redo/No UNDO log. The differences between these four types of logs require different time points for updating log files and data files, resulting in different performance and efficiency.

4. This section introduces another special log technology, "No Undo/No redo log", which is also called "0/1 directory" (0/1 directory ). There is also a master record that records the current working directory. For example, the old data is in the directory 0, and the new data is in the directory 1. When we access the data, we use the master record, the record is currently working in that directory. If it is working in directory 0, the data in directory 0 is obtained, and the data in directory 1 is obtained.

5. the MySQL Master/Slave database is also designed based on logs. The slave database can synchronize with the master database by simply playing back the logs of the master database. Because the slave Database Synchronization speed is not highly constrained by the master database update speed, this method can only achieve final consistency.

6. On a single machine, transactions are implemented by log technology or MVCC technology.

7. The idea of two-phase submission is relatively simple. In the first phase, the Coordinator asks all the participants if they can submit the transaction (ask the participants to vote), and all the participants vote for the Coordinator. In the second stage, the Coordinator makes a decision on whether the transaction can be committed globally Based on the voting results of all participants and notifies all participants to execute the decision. In a two-phase submission process, participants cannot change their voting results. The premise that the two-phase commit protocol can be globally committed is that all participants agree to commit the transaction. As long as one participant votes to discard the (abort) transaction, the transaction must be abandoned. The two-phase commit protocol does not have a good fault tolerance mechanism for such timeout exceptions. The entire process can only be blocked here, and the process status of the participants is unknown, the participant cannot commit transactions on the local node or discard the local node transactions.

8. The fault tolerance capability of the first and second phase Commit Protocols is poor.

9. The performance of the two-phase commit protocol is poor. One successful two-phase submission Protocol process, the Coordinator must interact with each participant in at least two rounds of interaction with four messages: "prepare", "Vote-commit", "Global-commit", and "Confirm global-commit ". Excessive interactions can reduce performance. On the other hand, the Coordinator needs to wait for the voting results of all participants. If a slow participant exists, the execution speed of the global process will be affected.

10. as the name implies, MVCC is a technology that implements Concurrency Control for multiple different versions of data. Its basic idea is to generate a new version of data for each transaction, when reading data, you can select different versions of data to read the integrity of the transaction results. When MVCC is used, each transaction is updated based on an effective basic version, and transactions can be performed in parallel. The idea is to retrieve data of the same version number on multiple nodes based on the version number.

11. The MVCC process is very similar to that of version control systems such as SVN, or version control systems such as SVN are MVCC ideas used.

V. Cap Theory
    1. The definition of CAP theory is very simple. The three cap letters represent three Conflicting attributes in the distributed system: 1) consistency (consistency ): copy consistency in Cap theory refers to strong consistency (1.3.4 );

2) availiablity (availability): The system can provide services when an exception occurs;

3) toleranceto the partition of Network (partition tolerance): The system can partition the network.


2. cap theory points out: a distributed Protocol cannot be designed, so that the three attributes of CAP are fully available at the same time, that is, 1) the copies under the Protocol are always highly consistent, 2) the service is always available. 3) the Protocol can tolerate exceptions in any network partition. The Distributed System Protocol can only compromise between the CAP three.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.