(ii) from a distributed consensus on cap theory, base theory

Last Update:2016-07-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Distributed consensus ledger

The question is raised

Consensus algorithm distributed systems

In the field of computer science, distributed consensus,distributed consistency is a very important and widely explored and demonstrated problem, first of all, three business scenarios.

Distributed consensus in multi vehicle cooperative control

1. Train station Ticketing

Quorum consensus protocol in distributed database

If our end-user is a frequent train traveler, he usually buys a ticket at the ticket office at the station, then takes the ticket to the ticket gate, then sits on the train and begins a good trip----everything seems to be in harmony. Imagine if the destination he chose was Hangzhou, and a train bound for Hangzhou was left with the last ticket, possibly at the same time, another passenger at a different ticket window bought the same ticket. If the ticketing system is not guaranteed to be consistent, both tickets have been successfully purchased. At the ticket gate, one of the passengers will be told that his ticket is invalid----Of course, the modern Chinese railway ticketing system has rarely seen such a problem. But in this example, we can see that the requirements of the end user for the system are very simple:

"Please give me the ticket, if there is no more tickets, please tell me when the ticket is invalid."

This puts forward strict consistency requirements for ticketing system----system data (in this case, the number of tickets for the train to Hangzhou) no matter which ticket window, every moment must be accurate!

2. Bank Transfer

If our end-users are a newly-graduated college student, they usually choose to remit money to their home when they get their first month's salary. When he arrives at the bank counter and completes the transfer operation, the bank's counter attendant will kindly remind him: "Your transfer will be credited after N business days!" "。 At this time the graduate has a certain frustration, will be told by the counter waiter: "Well, how long it doesn't matter, money not less good!" "----This has become a fundamental requirement for almost all users of modern banking systems.

3. Online Shopping

If our end-user is a net purchaser, when he sees an inventory of 5 of the right goods, will quickly confirm the purchase, write down the delivery address, and then place the order----However, in the moment of the order, the system may tell the user: "Insufficient inventory!" "。 At this time, most consumers will complain that they are too slow, so that the beloved product was snatched by others.

But in fact, there are experience in the development of online shopping system engineers must understand that on the Product Details page display of the inventory, usually not the actual inventory of the goods, only when the real purchase of the order, the system will check the real inventory of the product. But, who cares?

The interpretation of the problem

For the above three examples, I believe you must have seen that our end users in the use of different computer products for data consistency needs are not the same:

1, some systems, both to respond quickly to users, but also to ensure that the system data for any client is true and reliable, like the railway station ticketing system

2, some systems, the need for users to ensure absolute reliable data security, although there is a delay in data consistency, but ultimately must ensure strict consistency, like the bank's transfer system

3, some systems, although the user has shown some can be said to be "wrong" data, but throughout the system use process, will be in a certain process on the system data accurate inspection, so as to avoid unnecessary loss of users, like online shopping system

The proposed distribution consistency

One of the most important problems to be solved in distributed system is the duplication of data. In our daily development experience, it is believed that many developers have encountered the problem: Assume that the client C1 a value K in the system is updated by V1 to V2, but the client C2 cannot immediately read to the latest value of K, need to be read after a period of time. This is normal because there is a delay between database replication.

The replication requirements for data in a distributed system are typically derived from the following two reasons:

1, in order to increase the availability of the system to prevent a single point of failure caused by the system is not available

2, improve the overall performance of the system, through load balancing technology, can be distributed in different parts of the data copy can provide users with services

The great benefits of data replication for distributed systems in terms of availability and performance are self-evident, but the consistency challenges of data replication are something that every system developer has to face.

The so-called distribution consistency problem, refers to the introduction of the data replication mechanism in the distributed environment, may appear between different data nodes, and can not rely on the computer application itself to resolve data inconsistencies. In short, data consistency means that when you update a replica, you must ensure that other replicas are also updated, or the data between the different replicas will be inconsistent.

So how to solve this problem? One way of thinking is, " since the problem is caused by the delay action, I can block the write action until the data copy is complete before the write action is completed ." Yes, this seems to solve the problem, and there are some system architectures that actually use this idea directly. But this approach, while solving the consistency problem, brings new problems: write performance. If your application scenario has a lot of write requests, then after using this idea, subsequent write requests will block the write operation of the previous request, causing the overall performance of the system to drop sharply.

In all, we can't find a distributed consistency solution that satisfies all the system properties of a distributed system. Therefore, how to ensure the consistency of the data, without affecting the performance of the system, is that every distributed system needs to be considered and weighed. Thus, the consistency level is Born:

1. Strong consistency

This consistency level is most consistent with the user's intuition, it requires the system to write what, read out what will be, the user experience is good, but often realize the performance of the system is greatly affected

2. Weak consistency

This level of conformance constrains the system from being able to read the written value immediately after the write succeeds, and also shortly after the promise of how long the data will be consistent, but with as much assurance as possible to a certain time level (such as the second level), the data can reach a consistent state

3. Final consistency

Final consistency is a special case of weak consistency, and the system ensures that a consistent state of data can be achieved within a certain amount of time. The reason for this is that the final consistency is presented separately because it is a consistent model that is highly respected in weak consistency and a model that is respected by the industry for data consistency in large distributed systems.

Various problems in distributed environment

The architecture of distributed systems accompanied from the beginning of its emergence with many difficulties and challenges:

1. Communication anomalies

From the centralized to the distributed evolution process, inevitably introduces the network factor, because the network itself is not reliable, therefore also introduced the additional question. Distributed systems need network communication between nodes, so each network communication is accompanied by the risk of network unavailability, network fiber, router or DNS hardware devices or system is not available will cause the final distributed system can not successfully complete one network communication. In addition, even if the network communication between each node of the distributed system can be carried out normally, its delay will be greater than the single operation. Usually we think that in modern computer architecture, the delay of single-machine memory accesses is in the order of nanosecond (usually 10ns), and the delay of the normal network communication is about 0.1~1MS (105 times times the Memory access delay), so the huge delay difference will affect the sending and receiving process of the message. So message loss and message latency become very common

2. Network partition

when the network due to abnormal conditions, causing the network delay between some nodes in the distributed system is increasing, eventually leading to the distribution system of all the nodes, only some of the nodes can communicate normally, while others can not----we call this phenomenon network partition . When the network partition appears, the distributed system will appear the small local cluster, in extreme cases, these local small clusters will be independent of the entire distributed system to complete the function, including the data processing, which presents a very large challenge to distributed consistency

3. Three states

Above two points, we have learned that in a distributed environment, the network may have a variety of problems, so the distributed system of every request and response, there is a unique three-state concept, that is, success, failure, timeout . In a traditional stand-alone system, an application can get a very explicit response after calling a function: Success or failure. In the distributed system, because the network is unreliable, although in most cases, network communication can also accept a successful or failed response, when the network is abnormal, there may be a time-out phenomenon, usually the following two situations:

(1) Because of network reasons, the request has not been successfully sent to the receiver, but in the process of sending a message loss occurred

(2) The request was successfully received by the receiving party and processed, but a message loss occurred during the response to the sender's feedback

When such a timeout occurs, the initiator of the network communication cannot determine whether the current request has been successfully processed

4. Node failure

Node failure is another common problem in distributed environment, which refers to the outage or "zombie" phenomenon in the server nodes that make up the distributed system, usually, according to experience, every node is likely to fail, and every day happens

Distributed things

With the development of distributed computing, things have been widely used in the field of distributed computing. In a stand-alone database, we can easily implement a set of processing systems that meet the acid characteristics, but in a distributed database, where the data is scattered across different machines, how to deal with these data in a distributed way is a great challenge.

Distributed things are the participants of things, the servers that support things, the resource servers, and the things manager are located on different nodes of the distributed system, and usually a distributed thing involves the operation of multiple data sources or business systems.

One of the most typical distributed scenarios can be conceived: a cross-bank transfer operation involves the invocation of two off-site banking services, one of which is the withdrawal service provided by the local bank, and the other is the deposit service provided by the target bank, both of which are stateless and independent of each other, and together constitute a complete distributed thing. If the withdrawal from a local bank succeeds, but for some reason the deposit service fails, you must roll back to the state before the withdrawal, or the user may find that his money is missing.

As you can see from this example, a distributed transaction can be considered to be composed of multiple distributed sequences of operations, such as the withdrawal service and deposit service for the above example, which can often be referred to as a sub-thing in a series of distributed sequences of operations. Therefore, a distributed transaction can also be defined as a nested thing, and it also has the properties of acid things. But because the execution of each sub-thing is distributed in the distributed transaction, it is very complicated to implement a distributed thing processing system which can guarantee the acid characteristic.

Cap theory

A classic theory of distributed systems. The CAP theory tells us that a distributed system cannot simultaneously meet the three basic requirements of consistency (c:consistency), availability (a:availability), and partition fault tolerance (p:partition tolerance) , you can only satisfy two of them at the same time .

1. Consistency

In a distributed environment, consistency refers to the ability of data to remain consistent across multiple replicas. Under the requirement of consistency, when a system performs the update operation in the consistent state of the data, it should ensure that the data of the system is still in the state.

For a system that distributes copies of data on different distributed nodes, if the data of the first node is updated and the update succeeds, the data on the second node is not updated accordingly, so when the data of the second node is read, is still getting old data (or dirty data), which is the case of typical distributed data inconsistencies. In a distributed system, if a successful update operation is performed on a data item, all users can read to its latest value, then such a system is considered to be strong consistency

2, the availability of

Availability means that the service provided by the system must always be in a usable state, and that for each operation request of the user, the result can be returned within a limited time. The focus here is "limited time" and "return results."

"Limited time" means that, for a user's request for an operation, the system must be able to return the corresponding processing results within a specified time, if the time range is exceeded, then the system is considered unusable. In addition, "limited time" refers to the system design at the beginning of the design of the operational indicators, usually different systems are very different, in any case, for user requests, the system must have a reasonable response time, otherwise users will be disappointed in the system.

"Return results" is another very important indicator of availability, which requires the system to return a normal response result after the processing of the user's request has been completed. Normal response results can often clearly reflect the results of a team request, that is, success or failure, rather than a return result that confuses the user.

3. Fault tolerance of partition

Partitioning fault tolerance constrains a distributed system to have the following characteristics: when encountering any network partition failure, the distributed system still needs to be able to provide services that meet consistency and availability, unless the entire network environment has failed .

The network partition refers to in the distributed system, the different node distributes in the different sub-network (the computer room or the remote network), because of some special reasons causes these sub-networks to have the network to be not connected the condition, but each sub-network's internal network is normal, thus causes the entire system network environment to be cut into several isolated areas. It is important to note that the join and exit of each node that makes up a distributed system can be considered a special network partition.

Since a distributed system cannot meet the three characteristics of consistency, availability, and partition fault tolerance at the same time, we need to discard the same:

Use a table to illustrate:

Select Description
Ca Discarding partition fault tolerance, enhancing consistency and usability, is actually the choice of traditional stand-alone database.
Ap Discarding consistency (which is said to be strong consistency), pursuing partition fault tolerance and availability, is a choice for many distributed system designs, such as many nosql systems
Cp Abandon usability, pursue consistency and partition fault tolerance, basically do not choose, network problems will directly make the whole system unavailable

It is important to be clear that partition fault tolerance is a basic requirement for a distributed system. Since it is a distributed system, the components in the distributed system must be deployed to different nodes, otherwise it will be irrelevant to the distributed system, so there must be a sub-network. For the distributed system, the network problem is an inevitable anomaly, so the partition fault tolerance becomes a problem that the distributed system must face and solve. As a result, system architects often need to focus on how to strike a balance between C (consistency) and a (usability) based on the characteristics of the business.

Base theory

Base is the abbreviation for basically Available (basic available), Soft State (soft), and eventually consistent (final consistency) three phrases. Base theory is the result of the tradeoff between consistency and usability in cap, which is derived from the summary of the distributed practice of large-scale Internet system, which is evolved from the cap theorem. The core idea of base theory is that , even if there is no strong consistency, each application can achieve the final consistency of the system according to its own business characteristics and the appropriate way . Next look at the three elements in base:

1, Basic available

The basic use is that the distributed system in the event of unpredictable failure, allowing the loss of some of the availability----Note that this is not equivalent to the system is not available. Like what:

(1) Loss of response time. Under normal circumstances, an online search engine needs to return the corresponding query results within 0.5 seconds, but the response time of the query result increases by 1-2 seconds due to the failure.

(2) Loss of system function: Under normal circumstances, when shopping on an e-commerce site, consumers can almost complete each order, but in some festivals to promote shopping peaks, because of consumer shopping behavior surge, in order to protect the stability of the shopping system, Some consumers may be directed to a downgrade page

2. Soft state

Soft state means that the data in the system is allowed to exist in the middle State, and that the existence of the intermediate State does not affect the overall availability of the system, that is, the process of allowing the system to synchronize data between different nodes is delayed.

3. Final consistency

Final consistency emphasizes that all copies of the data, after a period of synchronization, can eventually reach a consistent state. Therefore, the essence of final consistency is the need for the system to ensure that the final data can be consistent, without the need for real-time guarantee the strong consistency of system data.

In general, base theory is geared toward large, highly available, scalable distributed systems, contrary to traditional acid properties, which are completely different from the strong consistency model of acid, but rather by sacrificing strong consistency to obtain availability and allowing data to be inconsistent over time, But eventually it reached a consistent state . But at the same time, in the actual distributed scenario, different business units and components of the data consistency requirements are different, so in the specific Distributed system architecture design process, the acid characteristics and base theory will often be combined.

(ii) from a distributed consensus on cap theory, base theory

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(ii) from a distributed consensus on cap theory, base theory

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(ii) from a distributed consensus on cap theory, base theory

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support