From the perspective of demand, Relationship Model and non-relationship model-the transformation of the Times 1

Source: Internet
Author: User
Tags database sharding
As we talked about last time, there was a mismatch between the actual needs of Internet applications and traditional databases.

 

Therefore, destruction and reconstruction have become the main voice of the new era.

 

For Internet applications, the most urgent requirement is to process massive amounts of data input by a large number of users and perform some logic processing before returning the results to users. Therefore, for online data processing, horizontally scalable capacity indicators can write TPS and read QPS infinitely increasing, which is the biggest and most urgent demand of Internet enterprises.

 

In comparison, in order to maximize performance and capacity, other indicators are pushed to the backend. This is also very easy to understand. The performance is not enough and the capacity is not enough. What we face directly is the inability to provide services. As an Internet application, if it cannot bring services to users, its core value is the biggest challenge. Therefore, the key requirements of Internet applications are high availability, high scalability, and high performance. Others are extra pursuits. Therefore, the demand sorting list for online processing on the internet can basically be considered: data scalability> request response time as short as possible> down time as short as possible> cost as low as possible> automatic and fast recovery> data read consistency> convenient and fast response needs of program developers.

 

However, the evolution of history is also progressive, and the previous people's experience must be taken into account. Because of the popularity of relational models, relational databases naturally were the most secure and reasonable expansion targets that people could find at the time. This is also the reason why horizontal segmentation of traditional databases, is the first solution to solve this problem.

 

In this type of solution, users are generally required to use a simple splitting condition to evenly distribute data to several or even hundreds of databases, so that no matter how much data, both of them can improve the capability by adding machines horizontally, so that the database can meet the requirements of "data scalability", "Request Response Time", and "down time is short ", these indicators are good but bad. After using this splitting scheme, the SQL statements or transactions that can run have to be limited, however, compared with the development cost, meeting the core needs is the first, so this solution basically meets the requirements.

 

From the 1980s s to this century, most applications used horizontal segmentation to meet the actual business needs. Naturally, people begin to demand more and hope to solve problems more simply. Naturally, some people will come up with the following ideas: we use relational databases very well, and the relational model expression is very clear and clear. Everyone is familiar with it. Why don't you use a middle layer, in this way, the data access layer is completely transparent to the upper layer and fully compatible with relational databases? In this way, we don't need to modify our business logic, but can enjoy almost unlimited horizontal scalability, right?

 

The idea is good, so a lot of people started to work in this direction. They also wrote this in the demand list at the beginning of the design: database Access is completely transparent to applications. users do not need to worry about the data in the lower layer. They only need to write the business logic according to the relational model, and everything else should be automatically ready.

 

However, when we started to meet this demand, we found several problems that must be solved.
1. How to do multi-machine transactions? 2. How can a multi-machine join operation be highly efficient? 3. How to implement distributed indexing?

 

Well, these problems have been plagued by problems that have not been solved until now. Why? Because the hardware hits the ceiling.

 

Let's make a simple analysis:
A computer is a classification system. From the CPU design to distributed storage, all operations are actually doing similar things. In the design of the CPU architecture, the core problem is that a data is to be accessed by multiple processors. What about sharing this data? Or is there a copy for each access?
Therefore, three standard CPU processing methods are available. Solution A: Global Sharing (SMP ). Solution B: Global Sharing is not supported. And the C solution: some shared parts do not share (NUMA.
The same is true in the field of distributed storage. To scale horizontally, it is necessary to minimize the amount of competing data. However, if all machines need to read and write data, only sharing can achieve the best performance. Database sharding obviously belongs to the MPP architecture. Therefore, when data needs to be read and written by multiple machines, data read/write coordination can only be achieved through communication between machines.

 

In this coordination, message queues or data bus between CPUs are used in a single machine. In the distributed field, coordination must rely on networks. The Network is a hardware ceiling.

 

Now that we want to talk about the ceiling, we need to make a brief analysis of the network so that we can better understand where the ceiling is: as a communication medium, we abstract it into the following key indicators: latency, throughput, and security.
For convenience, we compare the data transmission channels from CPU to memory:
Latency Network: 30 MS memory: 21ns
Throughput Network: 100 Mb/s memory: 800 ms/s
Security Network: inaccessible memory: Non-reachable

 

Obviously, latency increases a lot, while throughput decreases significantly. Because the network is not secure enough, more network interactions are required to ensure higher security. Among these attributes, latency is the most troublesome. Because latency is related to distance, latency cannot be controlled when distance changes. We will further discuss the impact of networks in the distributed field later.

 

Therefore, even if the algorithm is perfect, we cannot reduce the latency for the same data in the distributed field. This directly leads to the possible relational algebra or transaction process, with the latency directly increased by one hundred times. So that the business is unacceptable!
The relational algebra model also has similar problems, because the query of shared data needs to interact with multiple machines, the latency is much higher than the memory usage, so the latencyAlsoForced to increase a lot.

 

As a result, we were able to experience the same transaction or consistent experience as a standalone database in the distributed storage field.

 

Because this problem cannot be well solved, the solution here is divided:

 

We will not evaluate the quality of an idea here, but the final judgment criteria of all ideas and ideas are the actual problems he can solve.

 

Some people start from their actual scenarios and find that they are only using simple interfaces in some scenarios (such as storing business logs, the difficulty of implementing the relational algebra model based on the distributed environment is so frustrating. Then, naturally, there will be an inference. Isn't it good to discard the relational algebra model ?!
So a concept came into being: nosql. Do not use SQL. A wide extension is not to use relational algebra models and consistency models. The core idea of this school is radical: since the performance of the relational algebra model is so low, simply discard it and choose a simpler hierarchical model to solve the problem? Because it is relatively easy to implement the key Val database of a hierarchical model, a large number of nosql products emerged in a short time according to the concept of nosql. The basic feature of such products is that they only implement simple key-value interfaces, mainly focus on Automatic Data O & M, and use the new storage engine to achieve write optimization, this can reduce costs. Typical nosql systems include Cassandra, hbase, and MongoDB.
I think this idea is quite good. Unfortunately, a group of nosql developers have promoted nosql as an omnipotent developer for their own benefit, it is speechless to ignore the negative effects of the increase in development costs. In most cases, development efficiency is more important than automatic data O & M. In this case, blind use of the original hierarchical model interface to complete functions may greatly reduce production efficiency, we also hope you will be careful.

 

Others think that the relational algebra model has nothing to do with scalability, and hope to support the Distributed Relational algebra model in a new way. This attempt is now called newsql. Their main idea is to make proper trade-offs and try to reconstruct the relational algebra model in a distributed environment.

 

Some of the more distinctive ideas in this field are the implementation of memory-based databases. They assume that future databases should be built primarily based on larger memory and networks, the database successfully avoids the impact of slow disk devices on the system and reduces the latency to an acceptable level. Voltdb and MySQL cluster are representative.

 

In other attempts, we assume that the probability of a transaction conflict is small, and the number of transactions is large. Based on this assumption, in the traditional sense, people use read/write locks to achieve data read/write consistency, so as to support database queries. These implementations are most famous for Google's External Store and spanner.

 

The traditional database camp is gradually guided by the needs of the Internet, gradually dropping the burden of its history, and facing the needs of Internet applications with a lighter posture.

 

The implementation of Optimization for different reading scenarios is more like haoyue ..

 

After we have quickly browsed so many attempts to solve the current problem, let's make a simple summary of the current demand for online Massive databases:

 

At present, the original nosql movement has encountered bottlenecks, although in actual application scenarios, people can use their own user models with a well-designed key Val engine. However, as new business needs are constantly raised, users will gradually find that most of their development content is: organize the original data according to another key, then, it is provided to the user for query. This process is the main problem that relational databases can solve.


After the basic requirements are met, users' demands for convenience will gradually become the main contradiction. Therefore, we should learn from the excellent experience of traditional relational databases, making appropriate improvements and tailoring based on the actual needs of Internet applications is the direction of joint efforts of practitioners in all storage fields. However, at present, because the main needs of users are not fully met, different choices are made for different user scenarios, the era in which a storage mode solves all problems has passed, and the storage field has entered the Warring States era. Faced with so many storage products, a problem naturally arises: in the face of so many different database products, what kind of storage is my scenario suitable? In subsequent articles, I will also try to use relational databases to meet the actual needs in Distributed scenarios, this article describes the new challenges in the distributed storage field and the new attempts to cope with these new challenges. In order to allow everyone to quickly grasp the core context of distributed storage, so that they can quickly select or implement their own distributed storage products. ---------------------------------- I am a split line ----------------------------------- from: http://jm.taobao.org/2013/06/19/%E4%BB%8E%E9%9C%80%E6%B1%82%E5%87%BA%E5%8F%91%E6%9D%A5%E7%9C%8B%E5%85%B3%E7%B3%BB%E6%A8%A1%E5%9E%8B%E4%B8%8E%E9%9D%9E%E5%85%B3%E7%B3%BB%E6%A8%A1%E5%9E%8B-%E6%97%B6%E4%BB%A3%E7%9A%84%E5%8F%98%E9%9D%A9/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.