First, the basic important points of distributed system:
- Provide stateless nodes externally, the internal implementation of state-specific or stateless node logic, the node can be provided services, can also be stored data.
- The Byzantine issue, used in distributed systems, is to ensure that the service is available, rather than finding the wrong node if.
- Unusual common situations, machine outages, network anomalies, message loss, message scrambling, data errors, unreliable TCP. It may be that the message is down or the machine is down after processing is completed, and the confirmation message is a network exception after processing the completed task. It is also possible that the outgoing message is missing or is missing when sending a confirmation message. The data that may be sent out first is received
- Distributed state, success, failure, timeout. Time-out situation, can not determine whether the success, the original ibid.
- Data is stored on a mechanical hard disk, and it is possible that an exception occurs at any time, resulting in data not being stored correctly.
- Exceptions that cannot be categorized, such as the high, low, and erratic behavior of the system's processing power.
- Even a small probability event will rise to a large probability event in the millions, tens, and more of the computation per day.
- Replicas increase the redundancy of the data and increase the availability of the system, but the benefits of using replica generations also result in the cost of maintaining replicas. As with replica consistency, multiple replicas are consistent, and multiple replicas can be directly inconsistent.
- Consistency level: Strong consistency, monotonic consistency, read the latest data, session consistency, read the unified value by version. Final consistency, weak consistency.
- Distributed System performance metrics: throughput, response latency, concurrency, Common unit QPS, which is the processing power per second. High throughput can lead to low response, and they are mutually restrictive relationships.
- Availability metrics: You can measure the ratio of service time and non-service time to the number of successes and failures of the request.
- Scalability metrics: the ability to scale horizontally and increase the number of low-provisioned machines allows for greater computation and higher processing power.
- Conformance metrics: Achieve consistency between replicas, and consistency requires strict consideration of whether business is allowed.
Second, the principle of distributed system:
1. Hashing, hashing different values, mapping to different machines or nodes. When considering redundancy, you can map multiple hashes to the same place. The way the hash is implemented and the remainder is taken. When it is extended, it is difficult, the data is scattered on many machines, and the data is obtained from the machine when it is extended. and prone to uneven distribution of the situation.
A common hash, which is hashed with an IP, URL, ID, or fixed value, always gets the same result.
2. According to the data range distribution, such as ID in 1~20 on machine A, ID in 21~40 on machine B, id in 40~60 on machine C implementation, ID in 60~100 distribution on machine D, data distribution is more uniform. If a node has limited processing power, it can split the node directly. Maintaining meta-information on data distribution may present a single point of bottleneck. Thousands of machines, each of which is divided into n ranges, causes the metadata to be maintained to be too large, which may require several machines to be implemented.
Be sure to strictly control the amount of metadata, into the possible reduction of metadata storage.
3. According to the data quantity distribution, another kind of commonly used data distribution method is according to the data quantity distribution data. Unlike hashing and data range, the data volume distribution data is independent of the specific data characteristics, but instead treats the data as a sequential growth file and divides the file into several chunks (chunk) at a fixed size, with different data blocks distributed across different servers. Similar to distributing data by data range, distributing data by volume also requires recording the specific distribution of data blocks and managing the distribution information as metadata using the metadata server.
Because it is independent of the specific data content, the data distribution data is usually not skewed by data, and the data is always segmented and distributed in the cluster. When a cluster needs to be re-balanced, it can only be done by migrating data blocks. There is also no big limit to cluster expansion, which can be achieved by migrating some databases to newly added machines. The disadvantage of dividing data by data is the need to manage more complex meta-information, similar to the way of distributing data by scope, when the cluster size is large, the amount of meta-information becomes very large, and the efficient management meta-information becomes a new subject.
4. A consistent hash, constructs a Hashi, has a hash field [0,10], constructs 3 parts, [1,4]/[4,9)/[9,10), [0,1]/divided into 3 parts, the 3 part is a ring, the addition of the machine, the change is its adjacent nodes, sharing is the pressure of the nearby nodes, Its metadata is maintained in the same way as the amount of data distributed. Its future expansion, can implement multiple required nodes.
5. Build mapping metadata and establish a mapping table.
6. Copies and data distributions, distributing a copy of the data across multiple servers. For example, the application of a data, stored in a, B, C, 3 machines, if 3 machines, one of the problems, the request is processed to the other 2 machines, if the machine recovery, but also from the other 2 machines, copy data, and increase the burden of these 2 machines. If we have the application A and application B, each with 3 machines, then we can spread a application on 6 machines, b application also scattered on 6 machines, can achieve the same data backup, but the application stored data is scattered. A machine is damaged, but the load on the machine is evenly distributed to the other 5 machines. Recover data from 5 machines, its speed and to the pressure of the server is not small, and can achieve machine damage, replacement completely does not affect the application.
The principle is that multiple machines are replicas of each other, which is an ideal way to achieve load voltage divider.
7. Distributed computing idea, mobile data is not as mobile computing, on the principle of computing, reducing cross-process, cross-network, and other large-span implementation, the calculation of the necessary resources as close as possible. Because of the possibility of network, remote machine bottlenecks.
8. Common Distributed System Data distribution mode: GFS, HDFS: distribution by data volume; Map reduce is localized by GFS data distribution, BigTable, hbase by data range, pnuts by hash or data range, can be selected; Dynamo, Cassndra by a consistent hash; Mola, Armor, bigpipe are distributed by hash, Doris are grouped by hash and by data volume distribution.
Third, Data copy protocol
1. Replicas must meet a certain availability and consistency requirements, and specific fault tolerance, even if there are some problems can provide reliable services.
2. The basic protocol for data replicas, centralized and de-centralized 2 basic replica control protocols.
3. The basic idea of centralized replica control protocol is that a central node coordinates the update of replica data and maintains the consistency between replicas. The advantage of the centralized replica control protocol is that the protocol is relatively simple, and all replica-related controls are left to the central node for completion. The concurrency control will be completed by the central node, so that a distributed concurrency control problem can be reduced to a single machine concurrency control problem. Control problems, simplifying to a single machine concurrency control problem. The so-called concurrency control, that is, multiple nodes need to modify the replica data at the same time, need to solve "write", "Read and write" and other concurrency conflicts. The common locking on the stand-alone system is used for concurrency control. For distributed concurrency control, locking is also a common method, but without a centralized lock management, a fully distributed locking system is required, which makes the protocol very complex. The disadvantage of the centralized replica control protocol is that the availability of the system depends on the centralized node, and when the central node is abnormal or the communication with the central node is interrupted, some services (usually at least the update service) are lost, so the disadvantage of the centralized replica control protocol is that there is a certain outage time. There is a single point of problem, even if the central node is a cluster, but also a large single point.
4. Copy Data Synchronization FAQ, 1) network anomalies, resulting in the replica does not get data, 2) data dirty read, the master node data has been updated, but for some reason, did not get the latest data, 3) to increase the new node does not get the master node data, and read the data from the newly node read data caused by the data is not obtained.
5. There is no central node in the de-centralized replica control protocol, and all nodes in the protocol are fully equivalent, and the nodes are agreed by equal negotiation. So the de-centralized protocol does not have the problem of stopping service due to the central node anomaly. However, nothing is perfect, the biggest drawback of the de-centrality protocol is that the protocol process is usually more complex. In particular, when a de-centralized protocol requires strong consistency, the protocol process becomes complex and difficult to understand. Because of the complexity of the process, the efficiency and performance of the de-centralized protocol are low.
6. Paxos is the only strong consistent de-centralized replica control protocol that gets applied in engineering. ZooKeeper, Chubby, is the application of this Protocol.
Zookeeper uses the Paxos protocol to select leader, and controls whether the data is valid with the lease protocol. Synchronize the leader data to follow with the quorum protocol.
Zeekeeper, when the implementation of quorum write, if there is no complete write success, then all the follow machine, reverse to the leader write data, write data follow again to leader synchronization data, consistent, if the data is failed to write first, You follow sync to the original data, relative to rollback, if the latest data is written first leader is the latest data update.
7. Megastore, the improved line Paxos protocol is used.
8. Dynamo/cassandra uses a de-centralized protocol based on a consistent hash. Dynamo uses the quorum mechanism to manage replicas.
9. The lease mechanism is the most important distributed protocol, which is widely used in various practical distributed systems. 1) lease is usually defined as an agreement by the issuer to give the holder a certain right within a certain period of time. 2) Lease expresses the issuer's commitment within a certain period of time, as long as the unexpired issuer must strictly abide by the Lease agreed commitments, 3) The Lease's holder uses the issuer's commitment within the term, but Lease once it expires must be waived or renewed and renewed by the issuer. 4) of the impact. The lease that the central server emits means that the central server guarantees that the value of the corresponding data is not modified during the lease's validity period. 5) The lease certificate can be considered invalid by version number, excessive time, or to a fixed point in time.
The principle is the same as our cache, for example, browser caching is consistent. It requires time clock synchronization because the data is completely dependent on the term.
10. Heartbeat (heartbeat) detection is unreliable, if the detection and its Q, detected machine A, may be detected by Q, but the response of a is blocked, causing Q to think a down, blocking quickly recover, resulting from heartbeat detection to make a judgment unreliable; it could be a network disconnection between them. Or it may be that the Q machine itself is abnormal resulting in a machine being considered to be down, and if the test results of Q determine the likelihood of multiple hosts being present.
11. Write-all-read-one (Waro) is the simplest copy control rule, which, as the name implies, writes all copies at update time, only if the update succeeds on all replicas, it is considered to be successful, so that all replicas are consistent so that the data on either copy can be read when the data is read. Write multiple copies, read from one of the reads.
12. The quorum protocol, in fact, is that the number of successful copies of the read is greater than the number of failed copies, and the copy you read must contain the most recent copy.
13. All replica management in the mola* and armor* systems is based on quorum, which means that the data is successfully updated on most replicas.
14. The Waro mechanism is also used for replica management in Big pipe*.
Iv. Log Technology
1. Log technology is one of the main technologies for downtime recovery. The log technology was originally used in the database system. Strictly speaking, the log technology is not a distributed system technology, but in the practice of distributed system, the log technology is widely used to do the downtime recovery, even if the system such as bigtable to save the log to a distributed system to further enhance the system fault tolerance.
2. Two more useful log techniques are Redo log and No redo/no undo log.
3. The database's logs are mainly divided into undo log, Redo log, Redo/undo log and No redo/no undo log. The differences between the four types of logs differ in the point-in-time requirements for updating log files and data files, resulting in different performance and efficiency.
4. This section describes another special logging technique, "No undo/no Redo Log", which is also known as the "0/1 Directory" (0/1). There is also a master record, record the current working directory, such as the old data in the 0 directory, the new data in the 1 directory, we access the data, through the master record, the record is currently working in that directory, if it is working in directory 0, fetch directory 0 data, instead of 1 directory data.
5. MySQL's master-slave library design is also based on logs. It is possible to synchronize with the main library by simply replaying the log of the main library from the library. Because the speed of synchronization from the library is not strongly constrained by the speed of the main library update, this approach can only achieve eventual consistency.
6. On a single machine, transactions are implemented by techniques such as log technology or MVCC.
7. The idea of a two-phase submission is simple, and in the first phase, the facilitator asks if all the participants can submit a transaction (please vote for participants), and all participants vote for the coordinator. In the second phase, the facilitator makes a decision on whether the transaction can be submitted globally, based on the voting results of all participants, and informs all participants to implement the decision. In a two-phase commit process, participants cannot change their voting results. A two-phase commit agreement can be globally submitted on the premise that all participants agree to commit the transaction, so long as one participant votes to opt out (abort) The transaction must be discarded. It can be argued that the two-phase commit protocol does not have a good fault-tolerant mechanism for this timeout-related exception, the entire process can only be blocked here, and for the participants the process state is unknown, the participant cannot commit the transaction on the local node, or abandon the local node transaction.
8. First, two-phase commit protocol is poor in fault tolerance.
9. Second, two-phase commit protocol performance is poor. In a successful two-phase commit protocol process, the Coordinator and each participant need at least two rounds of interaction with 4 messages "prepare", "Vote-commit", "Global-commit", "Confirm Global-commit". An excessive number of interactions can degrade performance. On the other hand, the facilitator waits for all participants to vote, and once there are slower participants, the global process execution speed is affected.
10. As the name implies, MVCC is the technology of multiple different versions of data to implement concurrency control, the basic idea is to generate a new version of the data for each transaction, and select different versions of the data at the time of reading to achieve the integrity of the transaction results read. When using MVCC, each transaction is updated based on a base version that is in effect, and transactions can be performed in parallel. The idea is to take the same version number data in multiple nodes based on the version number.
11. The process of MVCC is very similar to the process of version control systems such as SVN, or the MVCC idea that SVN and other version control systems are used.
V. Cap theory
- The definition of CAP theory is simple, and the CAP three letters represent three conflicting attributes in a distributed system: 1) Consistency (consistency): Replica consistency in cap theory, especially strong consistency (1.3.4);
2) availiablity (availability): Refers to the system in the event of an abnormality can provide services;
3) Toleranceto the partition of the network (partition tolerance): Refers to the system can be partitioned
Fault-tolerant processing of abnormal conditions.
2.CAP theory points out: Unable to design a distributed protocol, so that at the same time fully with the CAP three properties, that is, 1) the copy of the Protocol is always strong consistency, 2) The service is always available, 3) the protocol can tolerate any network partition exception, distributed system protocol only in the CAP all the compromises between the three.
(turn) Principles of distributed systems