Thoughts on the evolution of large-scale website technology (III.)--Storage Bottleneck (3) (RPM)

Last Update:2015-01-28 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://www.cnblogs.com/sharpxiajun/p/4251714.html

Storage bottlenecks are now going into the deep-waters, if we have to do the site has been to do a vertical split of the database and horizontal split phase, then we face the challenges of technical difficulties will be greatly enhanced.

Here we review the definition of vertical split and horizontal split of the database:

Vertical splitting: Separate data from different business units in a database into different databases.

Horizontal splitting: The data of the same business unit is split into multiple databases according to certain rules.

Vertical split is a coarse-grained split data, it is mainly the original in a database of the table split into a different database, horizontal split granularity than vertical split to be finer point, it is to split a table into a different database, the size of the granularity will also lead to the difficulty of implementing the technology is not the same, It is obvious that the technical difficulty of horizontal splitting is much greater than the technical difficulty of vertical splitting. Difficulty means that the cost of input and the increase in the risk we need to take, we do system development must have a clear understanding: can be used to solve the problem with a simple solution, we must not hesitate to abandon the complex solution, when the system needs to use high-difficulty technology, We must let ourselves feel that this is a necessity .

I was a Java engineer applied to my present company, so in my turn to full-time front, I have done a lot of Java application Development, I was in the company's predecessors told me that our company's database modeling is very simple, how a simple method, the database table has no foreign keys, database is not allowed to write triggers, Can write stored procedures, but the stored procedures must not be used to process the production business logic, but only some ancillary work, such as import and export write data ah, I heard that even if the database to do a read and write separation, synchronization between data is best done with Java programs, do not use stored procedures, unless forced. At first I did not understand these practices, this does not understand that I questioned the company's approach, but I am thinking that if a database we use such a bit of functionality, it is better to let the database company to customize a castrated version of it, but after I learned about Hadoop, I have a little understanding of the meaning behind this, In fact, as a database to store data, it is the same as the nature of the program we developed is: storage and computing, then when the database as a business system storage media, then its storage to the business system is much more important than it can assume the computing function, When the database as the storage medium of the Internet system, if this Internet system grows rapidly, then this time we to the database storage requirements will be more and more high, and finally estimated that we want to the database of the computational characteristics of castration, of course, the basic database of additions and deletions we can not abandon, Because they are the gateway to the database and the outside world, if we touch a database with huge amounts of data, we will find that the individual SQL statements that make the database run are unusually concise and simple, because at this point we know that the database is already taking too much of the burden of storing this piece, Then the only way we can help the database is to minimize the stress of its operation .

Back to the question of vertical split and horizontal splitting of the database, if our database design is based on our company's business database, what kind of problem will we encounter when the database is split horizontally? In order to answer this question I will compare the next split and split after the call to the database program to bring about how different, the difference is mainly two points:

1th: The split table and other tables in the original library are associated with queries that use join the operation of the query needs to be changed;

2nd: Some additions and deletions (note: General business library design rarely use physical deletion, because this operation is very dangerous, the deletion is often a logical deletion, the general practice is to update the state of the record, the essence is an update operation) involves splitting the table and the other tables of the original library together to complete, then the operation of the transaction will be broken, If handled poorly, if the operation fails, the business cannot be rolled back, which poses a great risk to the security of the business operation.

About solving the 1th problem is relatively simple, the way is also a lot of methods, the following I would like to talk about some of the methods I know, specifically as follows:

　　method One : In the vertical split table, we first comb the use of the JOIN operation SQL query, comb the dimension is the split table as the origin, if it is weakly dependent join table We rewrite the next SQL query statement, if the strongly dependent join table is split with the split table, This method is very simple and very controllable, but there is a problem with this technical solution, is to make the split granularity larger, split business rules are disturbed, so split very easy to make a problem is that there will always be a database of such tables, that is, many databases are associated with it, it is difficult to dismantle these relationships, When we can not clear the time will be redundant, that is, the existence of similar tables in different databases, as the business grows, the data synchronization of this table becomes a soft rib of the database, and eventually it becomes the whole database system of the short board or even the whole system of short board.

　　Method Two : We break the table of criteria or according to the needs of the business at the database level, and so on, after the database is dismantled, and then overwrite the original affected join query statement, here I would like to explain is the cost of the query statement modification is very low, because the query operation is a read-only operation, it will not change any underlying things, If the data table cross-Library, we can split the join query into multiple queries, and finally the query results in memory summed up and merged, in fact, if we take the initiative to open the library, will never change a different database products to build a new library, it is certainly using the same database, the same type of database basically support cross-Library query, But the cross-Library query heard that efficiency is not very well, we can have a choice of use. This program also has a fatal disadvantage, we do the database vertical split can never be in place, generally is a number of iterations, and the impact of the scheme is very large, too many parties, each time the table to remove almost all the relevant SQL statements, which will cause the system to accumulate unpredictable risks.

The following three paragraphs are method three:

Either method one or method two, there is a very fundamental flaw is that the database and high-level business operations are highly coupled, each time the database changes lead to business development followed by a large number of synchronous work, the result is a waste of resources, do service people can not be the database led by the nose every day, So the daily maintenance of business systems and business expansion will be very problematic, then we must have a service and database decoupling scheme, so here we have to learn from ORM technology. (Here I want to illustrate, method one and method two I have to modify the SQL elaboration, in the real development of many systems will use ORM technology, the Internet is generally used ibatis and mybatis this semi-ORM products, because they can directly write SQL and database closest, Hibernate is different, but although most of hibernate is not directly written SQL, but it is only a database operation to make a layer of mapping, the essential means is consistent, so the above SQL can be regarded as a reference, it also includes ORM mapping technology.

Traditional ORM technologies such as Hibernate and MyBatis are for libraries and do not solve the problem of vertical splitting, so we have to develop an ORM system to solve the cross-library operation. I'm just going to talk about my own perspective on the ORM of the query (is there some people who have a feeling of becoming acquainted with it, and this is not like a distributed system?).

In fact, how to reconstruct the problem of SQL is not the problem I want to discuss, because this is a technical means or a technical skill problem, I am here to focus on this ORM and service layer interface interaction, for the service layer, the service layer is the most afraid of the database is led by the nose, Because when the database to make a major change, the service layer always try to let themselves do not change, for the database layer of the service layer should be reasonable, the database layer to the service layer as their own demand side, so that both sides can work together to complete this important task, So how does the service layer interact with the database layer in general?

From the traditional ORM technology we can find the answer, in the specific way there are two kinds:

The first : Hibernate-represented, hibernate framework has its own query language is HQL, it is similar to SQL, custom set of query language looks cool, but also very flexible, but the implementation is very difficult, Because this is the equivalent of our own to write a new set of programming language, if the language is not well designed, users understand not in-depth, and eventually often backfire, like Hibernate's hql, we often make the direct use of SQL is not willing to use HQL, This one of the reasons used by people must be very well understood.

The second kind : is the data layer provides the service layer the invocation method, each method corresponds to a specific database operation, even if the underlying database has undergone significant changes, as long as the method definition provided to the service side is unchanged, then the database changes will have the lowest impact on the service layer.

I mentioned earlier that technical difficulty is an important indicator of our choice of technology, whereas the second option will be our first choice.

Splitting the database vertically also brings another problem, which is the impact on the transaction, the vertical splitting of the database will cause the original transaction mechanism to become a distributed transaction, solve the distributed transaction problem is very difficult, especially if we want to use the industry introduced to solve the distributed transaction scheme, it is more difficult to implement a distributed transaction itself, But here I would like to explain, I said here is more difficult to write about my article, I am writing this article is now because I want to study the industry introduced distributed solutions, but the principles of these scenarios I am very frustrated, I think if we directly with the program interface to achieve it, because still do not understand his many principles, So these programs are actually an uncontrolled solution, perhaps too much will put a time bomb on the system, so here I only mention these programs, interested children shoes can be studied:

First, X/open Distributed transaction specification XA for organization rollout , which also includes the Distributed transaction processing model defined by the organization X/open ;

Second, Cap/base of the consistency theory of large Web sites

Third, PAXOS agreement.

Here is a special mention of the Paxos protocol, I have written several articles about zookeeper, zookeeper framework has a feature is that it is a distributed file system, when we go to zookeeper to write data, Zookeeper cluster can guarantee the reliability of our write operation, this reliability and we use thread security to control write data, absolutely will not let write operation error, reason zookeeper can do this, because zookeeper inside has a similar Paxos protocol, This protocol is similar to an electoral scheme, which guarantees the atomicity of write operations.

In fact, the transaction is also similar to the thread security technology, but the transaction is to ensure a business operation of the atomic problem, of course, the transaction must also have a feature is the rollback mechanism that the business operation failed, the transaction can guarantee the system recovery to the state before the business operation, the essence of the rollback mechanism is to maintain the state of business operations , let me cite an example here: when the system is going to perform a business operation, we first define an initial state for the business system, when the business execution operation we can define an execution state, the operation success is a success state, the operation failure is an operation failure state, If a business operation is a failed state, we can roll the business back to its initial state, and further, if the execution state times out can also fall back to the initial state of the entire business state, the essence of all transaction rollback mechanisms is essentially the same . I remember not long ago, a group of friends in the group asked how to achieve distributed transactions, he wants to know the distributed transaction is there is no technology like we operate a database or a commit, a JDBC, a rollback to take care of, But the reality of the distributed transaction is more complex than commit and rollback, it is impossible to simply let us write a few tags to achieve distributed transactions, of course, the industry is a scheme, I mentioned above, if someone really want to know that you can study, But I still do not understand the above principles and ideas of the technology.

In fact, I immediately to the group of friends a solution, I said that we are often encountered in the development of distributed transactions, but we solve the problem of distributed transactions from the business perspective, rather than the choice of purely technical means, because the technology is too complex to control. The answer may not be satisfactory to the questioner, but I am still sticking to that view, which is in line with the principles I've mentioned, and when the technical solution is too difficult, we don't have to choose to use it because it's dangerous, and I'll give you an example today, which might be more convincing. I am now doing a lot of business operations and other systems are often done together, other systems have our own system, there are other enterprise systems, here I still compare business operations to a car on the highway, then each system is a toll station on the highway, business every to a toll station, The database of the system will be in a table in the corresponding database a record of a state, when the car runs through the whole process, the various toll stations will notify each other, to tell you that the task is completed, eventually all the status is completed, if the failure, on the scrap of the car, toll stations will also notify each other, Let all the records state back to the original state, when the car never came. The principle of this approach is to use the nature of the transaction rollback, the change of state and the fallback, which in the business system development also has a proprietary terminology is the workflow. In fact, most of the questions about how to implement distributed transactions is the essence of the problem is to solve the rollback of the transaction, we actually do not be the name of the distributed transaction to frighten, in fact, there are many obscure technical means and business means can achieve the same purpose.

11 o'clock in the evening, it seems that this article is not finished today, so far, and finally I want to summarize the contents of this article, as follows:

　　1. Large-scale web site to solve the problem of storage bottlenecks, we want to locate the key point of storage, because the database is actually a combination of storage and operation, but in our scenario, storage is the first, when the storage is the bottleneck when we want to be ruthless to abandon the data as much as possible the calculation characteristics, So above I proposed that our database should not misuse computational functions such as triggers, stored procedures, and so on.

　　2. the database stripping calculation function does not represent the calculation function of the data, because there is no data computing function database is worthless, then we have to migrate the calculation function of the database, migrated to the program, the general large system programs and databases are deployed separately to different servers , so processing data in the program does not affect the performance of the server on which the database resides, so the server that installs the database can concentrate on serving the storage.

　　3. We have to do everything possible to minimize the impact of changes in the database on the service layer, preferably after the database is split, the existing business does not have any changes, then we have to design a new data access layer, the data access layer to decouple the database and service layer, Any database changes are digested by the data access layer, the data Access layer external interface to be highly unified, do not easily change.

　　4. If we design the data access layer to solve the problem of database splitting, the data access layer plus the database actually combines a distributed database solution, it is very difficult to split the database is very high, because the database will have distributed features, Distributed development means more difficulty in development.

　　5. for the processing of distributed transactions, we try to analyze from specific problems, do not feel that this transaction operation is essentially a distributed transaction to find a common distributed transaction technology means, the idea is actually to avoid difficult ideas, The result may be to make the problem more complicated.

Well, write here today, I wish you good night, happy life!

Thoughts on the evolution of large-scale website technology (III.)--Storage Bottleneck (3) (RPM)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More