How to beat the CAP Theorem

Source: Internet
Author: User

Http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Http://kb.cnblogs.com/page/124567/

The CAP theorem indicates that a database cannot satisfy consistency, availability, and partition fault tolerance at the same time ).

However, partition fault tolerance cannot be sacrificed. Therefore, we can only make trade-offs on consistency and availability. How to handle such trade-offs is the core focus of nosql databases.

However, there are many problems at the sacrifice of availability, and the complexity of building and maintaining the system at the sacrifice of consistency is very high. However, there are only two options here, and no matter what you do, it will be not perfect. The CAP theorem cannot be changed. Is there any other possible choice?

 

You cannot avoid cap theorem, but you can separate complicated problems, so that you do not lose control of the entire system. The complexity brought about by the CAP theorem is actually a manifestation of the fundamental problem of how we build a data system.
Two of them are particularly important:Incremental algorithms for changing and updating statuses in Databases(The use of mutable state in databases and the use of incremental algorithms to update that State)
Complexity is caused by the interaction between these two points and the CAP theorem.

This article will challenge your assumptions about how to build a data system. By revolutionizing the traditional data system construction method, I will show you an unprecedented elegant, scalable, and robust data system.

 

What is a data system?

The following simple definition:

Query = function (all data)

It summarizes all fields of databases and data systems. Every field-RDBMS, index, OLAP, OLTP, mapreduce, EFL, distributed file system, stream processor, and nosql with a history of 50 years-can be summarized into this equation.

The so-calledThe data system is the system to answer the dataset question.These problems are called "queries ". The above equation indicates that a query is a function of data.

A Data System answers questions about a dataset. Those questions are called "queries". And this equation states that a query is just a function of all the data you have.

There are two key concepts in this equation: Data and query. These two completely different concepts are often confused, so let's take a look at what these two concepts actually mean.

Data

"Data" has two key properties.

First,Data is time-related.A real data must exist at a certain point in time. For example, if Sally writes in her social network profile that she lives in Chicago, the data you get must have been filled out in Chicago some time ago. If Sally updated her profile to Atlanta one day, she must have lived in Atlanta at that time, but the fact that she lives in Atlanta cannot change the fact that she used to live in Chicago-both data are real.

 

Second,Data cannot be changed. Because the data is related to a certain time point, the authenticity of the data cannot be changed. No one can change the authenticity of the data at that time, which means there are only two types of data operations: Reading existing data and adding more new data.
SoCrud becomes cr[Translator's note: crud refers to create read update Delete, that is, Data creation, reading, updating, and deletion ].

Query

A query is a derivation of a dataset, just like a theorem in mathematics.

Previously, I have defined that the query is a function on the entire dataset. Of course, not all queries require the entire dataset, but only a subset of the dataset. However, my definition covers all query types. To "defeat" the CAP theorem, we need to be able to process all the queries.

Defeat cap Theorem

The CAP theorem is still applicable, but common problems caused by the CAP theorem can be solved throughUnmodifiable data and computed queries from raw data (using immutable data and computing queries from scratch)

The so-called Cap defeat is to ensure consistency while ensuring availability.

HereThe key is that the data is unchangeable.. Immutable data means there is no update operation here, so it is impossible to replicate data differently. It also means no versioning data, vector clock, or read and repair.
The previous complexity mainly comes from the conflict between the incremental update operation and the CAP theorem. In the final consistency system, variable values must be read and repaired to ensure final consistency. By using immutable data, removing incremental updates, and using immutable data, you can avoid complicated problems by calculating and querying the original data each time. The CAP theorem is defeated.

When the data is not changeable, it can greatly simplify the design. It is similar to MVCC and hbase.

The row version management similar to MVCC and hbase cannot reach the above level of human error tolerance. MVCC and hbase row version management cannot permanently store data. Once the database merges these versions, old data will be lost. Only the immutable data system can ensure that you can find a way to restore data when writing error data.

 

Batch computing

The problem of "how to make any function can be quickly executed on any dataset" is too complicated, so we first relaxed the issue of dependency conditions. First, assume that the data can be delayed for several hours. After this condition is relaxed, we can get a simple, elegant, and general data system construction solution. Then, we will extend this solution so that it can solve the problem without relaxing the conditions.

 

We will file the dataStored in HDFS. A file can contain a sequence of data records. When adding data, you only need to add a new file containing the new record in the folder that contains all data. In this way, data stored in HDFS meets the requirement that "large and growing datasets can be easily stored.

On the Pre-calculated DatasetQueryIt is also intuitive,MapreduceIs a complex enough framework, so that almost all functions can be implemented according to multiple mapreduce tasks. Tools like cascalog, cascading, and pig make it easy to implement these functions.

Finally, in order to quickly access these pre-calculated query results, you need to index the query results. Here, many databases can do this.Elephantdb and Voldemort read-onlyYou can export key/value data from hadoop to speed up query. These databases support batch writing and random reading, and do not support random writing. Random write makes the database complex. Therefore, random write is not supported. These databases are designed to be very simple, just thousands of lines of code. Simplicity makes these databases very robust.

 

Real-Time Layer

The above batch processing system almost completely solves the real-time requirement of running any function on any dataset. Any data that has been computed for more than a few hours has entered the batch processing view, so the rest to do is to process the data in the last few hours. We know that it is easier to query data over the past few hours than to query the entire dataset. This is the key point.

To process data from the past few hours, a real-time system and a batch processing system are required to run simultaneously. The real-time system pre-calculates the query function on the data in the last few hours. To calculate a query function, you need to query the batch processing view and Real-Time View, and combine them to obtain the final data.

In the real-time layer, you can useRiak or cassandraThis type of read/write database depends on the incremental algorithms for status updates in those databases at the real-time layer.

The tool that allows hadoop to simulate real-time computing isStorm. I wrote Storm to enable hadoop to process massive real-time data in a robust and scalable manner. Storm runs infinite computing on data streams, and provides powerful support for such data processing.

 

Batch Processing Layer + Real-Time Layer, Cap theorem and human error adequacy

It seems that I have returned to the question raised at the beginning. To access real-time data, I need to use nosql databases and incremental algorithms. This shows that it is back to version data, vector clock, and read to fix these complex problems. But there is a fundamental difference. Since the real-time layer only processes data in the last few hours, all real-time layer computing will be re-computed by the final batch processing layer. So if something goes wrong or something goes wrong in real time,Eventually, all the complicated problems will be corrected by the batch processing layer..

 

Summary

The reason for the complexity of a scalable data system is not the CAP system, but the incremental data algorithm and variable data status.
Recently, the rise of distributed databases has led to increasingly uncontrollable complexity. As mentioned above, I will challenge the assumption of the traditional data system construction method. I changed crud into Cr, and divided the persistence layer into two layers: Batch Processing and real-time. In addition, I was able to tolerate human errors. After years of hard-won experience, I broke my assumptions about traditional databases and came to these conclusions.

The batch processing/real-time architecture has many interesting capabilities that I did not mention. I will summarize some of them below.

  • Algorithm flexibility. As the data volume increases, some algorithms become increasingly difficult to calculate. For example, when the number of identifiers is calculated, it is increasingly difficult to calculate the number of identifiers when the number of identifiers increases. The batch processing/real-time separation system gives youFlexibility of using precise algorithms on the batch processing system and using approximation algorithms on the Real-Time System. The calculation results of the batch processing system will eventually overwrite the calculation results of the real-time system, so the final approximate value will be corrected, and your system has "final accuracy ".
  • Data Structure migration becomes easy. The difficulty of Data Structure migration will never go forever. Because batchcompute is the core of the system and it is easy to run a function on the whole system, it is easy to change the structure or view of your data.
  • Simple Ad-Hoc Network. Because the batch processing system is arbitrary, you can perform any query on the data. Because all data can be obtained at one point, the Ad-Hoc network becomes simple and convenient.
  • Self-Check. Because the data is immutable, the dataset can be self-checked. A dataset records its data history, which is useful for human error tolerance and data analysis.

 

 

In the face of big data, a different approach is proposed.

Traditional methods must use complicated logic to ensure data consistency while ensuring availability. For example, Dynamo solutions and vector clock (vector clock) merge version history of record data...

The author believes that the reason for the complexity is not the CAP system, but the data increment algorithm and the variable state of data.

So his solution is,

First, the data is unchangeable, Cr, so there is no data version Problem

Second, the use of full data analysis, rather than incremental analysis, conforms to the author's definition of the data system, the full set of queries

For full-set queries, we can use hadoop, but the problem is that this is slow, there will be a few hours of delay, but it can simply ensure the final consistency, the final results will be delayed for a maximum of several hours, will not be lost, it will not be changed because the data is unchangeable.

The problem is that in some fields, the extension cannot be tolerated for several hours. In addition to real-time analysis, real-time analysis is a supplement without the need to ensure strong consistency, similar algorithms can be used to improve processing efficiency. In the end, a complete set of queries will give the correct answer to ensure final consistency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.