Thoughts on technology evolution of large Web sites (iv.)--Storage bottlenecks (4) __cloudserver

Source: Internet
Author: User
Tags one table uuid

If the database needs to be split horizontally, this is actually a very happy thing, because it represents the company's business is growing rapidly, for developers that is there are endless projects to do, although it will feel very busy, but the full of people, the heart is also steadfast.

Database horizontal split Simple is to first the original database in a table in the vertical split out to place in a separate database and separate tables and then further split the table that is a whole into more than one table, each table with a separate database for storage. When the table is split horizontally, the original data table becomes a logical concept, and the business meaning of this logical table requires more than one physical table to work together, so the table of the database is split horizontally, so our operation on this table is beyond the available means of the database itself. In other words, our table operations will exceed the processing capacity of the database itself, this time I need to design the relevant scheme to make up for the lack of database capacity, this is the database horizontal split the biggest technical difficulties.

The horizontal split of the database is an upgraded version of the vertical database split. It and vertical splits are more like parent-child relationships in inheritance mechanisms, so the problem of join queries and the problems of distributed transactions that are encountered by vertical splits after split horizontally, as tables are physically dismantled, increase the dimension of the logical table, This also adds more dimensions to the two problems encountered in vertical splits, so the problem of join queries and distributed transactions in a horizontal split can become more complex. Apart from the vertical splitting of two challenges, horizontal splitting can also create new technical challenges, which are as follows:

Difficult one: The database table is split horizontally, the table's primary key design will become very difficult;

Dilemma Two: The original single table query logic will face challenges.

In preparation for this article, I saw some of the information also mentioned a number of difficulties, these problems are:

Puzzle Three: After the horizontal split table, the foreign key design will become very difficult;

Puzzle Four: This challenge is for the new operation of the data, the general meaning of what rules we need to store data stored in the specific physical data.

Problem three, I have given the answer in the previous article, here I make a certain supplement, in fact, the foreign key problem in the vertical split has already existed, but in the vertical split when we did not talk about this problem, this is mainly I set a premise that the data table in the most original data modeling phase will discard all the foreign key design, And the logic of the foreign key is thrown to the service layer to complete, we have to do our best to reduce the operational pressure of the database, in fact, in addition to reducing the database operating pressure, we also want to keep the table as a storage atom of relative independence, not related, then the most direct way to do this is to remove the table and the relationship between the table symbol: foreign key, So we can from the foundation for the future of the database to do vertical split and horizontal split to lay a solid foundation.

As for the problem four, the essence of the problem is the question of where the specific data is falling after the database is divided into tables, and the data stored in the table is the key obstacle is actually the primary key, imagine, we design a sheet, all the fields we are allowed to be empty, but there is a field is absolutely not empty, that is the primary key, The primary key is the identity of the data in the database, so we are in the primary key design can reflect the rules of the data, then the problem four will also be resolved. So I'm going to focus on the first two horizontal split problems.

The first is the problem of primary key design in horizontal split, leaving aside all the business implications that all primary keys can represent, the essence of the primary key in the database is the uniqueness of a record in the expression table, when designing the database we can represent the primary key by an absolutely not repeatable field, or we can use multiple fields to express this uniqueness. Using a field to represent a primary key is already an atomic operation and cannot be further modified, but if you use multiple fields to represent a primary key that is problematic for a horizontal split, this question is mainly reflected in which database the data landed in. About the impact of primary key on data landing I will explain the relevant knowledge and then focus on the elaboration, the point here is that we can set a field that doesn't have any business meaning when encountering a joint primary key, but this depends on the scenario, and I tend to combine the values in each of the fields of the Union primary key into a single word to represent the primary key, If some friends think this will result in data redundancy, then you can simply remove the original to do a joint primary key related fields are represented by a field, but the merge field when using a separator, so that the service layer for business split.

As described above, here I give the first principle of horizontal split primary key design: The primary key design of a table that is split horizontally is best represented by a field.

If our primary key only expresses record uniqueness, then the horizontal split is much simpler, for example, there is a sequence mechanism in the Oracle database, which is actually a self increasing algorithm, and almost all relational databases have a self increasing mechanism. We usually like to use the primary key field design scheme, if we want to split the table, the use of the self-added field, and this self-added field is only used to express record uniqueness, then the horizontal split time to deal with a much simpler, I give here two classic scenarios, the scenario is as follows:

Option one: self-adding columns have the characteristic of setting step size, if we are going to split a table into two physical tables, then we can design the increment of the primary key by 2 in one of the tables and the starting value is 1, then the rule is 1,3,5,7, and so on. Another physical table can be set to the step size of 2, if the starting value of 2, then the 2,4,6,8 and so on, so that the primary key of the two tables will never repeat, and we do not have to do another two physical tables corresponding logical association. The potential benefit of this scheme is that the size of the step and the granularity associated with the horizontal data split are also the amount of space that we have left for the expansion of the horizontal split, for example, if we design the step size to 9, then the theoretically horizontal split of the physical table can be expanded to 9.

Scenario Two: the physical table that is split we allow it to store up to a maximum amount of data, we are in fact in advance through a certain business technical rules roughly estimated, if we estimate a table we let it store 200 million, then we can set the rule of the self-added column, the first physical table since the increase from 1, the step is set to 1, The second physical table's self added column starts at 200 million, the step size is set to 1, and the self-added column is limited by the maximum, and so on.

So if the table's primary key is not using the self-adding column, but rather the unique field of the business design, then how do we deal with the primary key distribution problem. This is a typical scenario, for example, there must be an order form on the trading site, the design of the flow meter, the order table has the order number, the flow chart has the serial number, these numbers are defined according to certain business rules and ensure its uniqueness, then the previous self-added solutions can not complete their horizontal split the primary key problem, So how do we deal with this situation? We carefully aftertaste the database horizontal split, it is actually similar to the distributed cache, database primary key is equivalent to the key value in the distributed cache, then we can follow the scheme of distributed caching to design the model of the primary key, the following scenario:

Scenario One: Use integer hash algorithm, string if hash operation will come to a value, this value is the string's unique flag, if we slightly change the contents of the string, the computed hash value is certainly different, two different hash values correspond to two different strings, A hash value has and only corresponds to only one string, and the Md5,sha in the encryption algorithm uses the principle of the hash algorithm to compute a uniquely labeled hash value that can be used to determine whether the data has been tampered with by matching the hash value. But most hash algorithms end up with a combination of character alphanumeric numbers, where I use an integer hash algorithm so that the computed hash value is an integer. Next we need to count the number of servers we use to do horizontal split, if the number of servers is 3, then we will compute the integer hash divided by the number of the server is modulo calculation, through the resulting remainder to select the server, the algorithm schematic diagram as follows:

Scenario Two: The upgraded version of the solution is a consistent hash, the maximum effect of a consistent hash is to ensure that when we expand the number of physical data tables and the physical table cluster in the failure of a server to reflect, this issue I will discuss the future of the physical database expansion of the problem, so here is not to start the discussion.

From the above, we found that when the database is split horizontally, we set the algorithm through the primary key uniqueness, according to the uniqueness of the primary key design, the final data landed in which physical database is determined by the principle of the primary key design, Back to the above I mentioned if the data table of the original library uses the Union field to design the primary key, then we have to first merge the Joint primary key field, and then through the above algorithm to determine the rules of the data, although not to merge a field does not look too cumbersome, but in my years of development, the uniqueness of the field split into multiple fields, is equal to the addition of the dimension to the primary key, the more fields, the greater the dimension, to the specific business calculation we have to always pay attention to these dimensions, the result is very error-prone, I personally think that if the database has been to the level of split phase, then the importance of database storage is greatly enhanced, To make the database's storage features pure and clean, we have to try to avoid adding complexity to the database design, such as removing the foreign key and merging the joint field here as a field, in order to reduce the difficulty, even the necessary redundancy is worth it.

There is a more direct solution to the problem of primary key uniqueness after the horizontal split of the database table. This is also a lot of people encounter this kind of problem is naturally thought of method, that is, the primary key generation rules into a primary key generation system, placed on a single server unified generation, each new data primary key from this server to get, The primary key generation algorithm is actually very simple, many languages have the ability to compute the UUID, the UUID is a globally unique indicator based on the hardware information of the server on which it is located, but I did not first come up with this scheme because it was so much more flawed than the one I had before, and I'm going to count its flaws below. Specifically as follows:

Disadvantage one: Put the primary key generation into the external server, in this way we have to complete the transmission of primary key values through network communication, and the network is the most efficient way in the computer system, so it will affect the efficiency of the new data, especially when the data is very large, the new operation is very frequent, the disadvantage will be magnified a lot;

Disadvantage two: If we use the UUID algorithm to generate the key algorithm, because the UUID is dependent on a single server, then the entire horizontal split of the physical database cluster, the primary key generator becomes the entire system of the short board, but also the key short board, the primary key generation server failure, the entire system will not be used, and a table needs to be split horizontally, and the split table is a business table, then this table in the system's importance is naturally high, if it did a horizontal split after a single point of failure, which for the entire system is fatal. Of course, some people must say, since there is a single point of failure, then we do a cluster system, the problem is not solved. The idea can really solve the problem I've outlined above, but as I mentioned before, we have to stick to a principle in the development of the real software system. The simple solution is to choose a simple solution to solve the problem, the introduction of the cluster is the introduction of a distributed system, so as to increase the development of system development difficulties and operational risk, If we can solve our problem with the above plan, why should we ask for such a complicated plan?

Disadvantage three: The use of external systems to generate a primary key so that our horizontal split the database program to increase the status of the schemes I mentioned above are stateless, and stateful systems interact with each other, such as using an external system to generate primary keys, so that when data operations grow, it is inevitable that resources compete on the primary key system, If we do not handle the competitive state on the primary key system, it is very likely that the primary key system will be deadlocked. This will also produce the 503 errors I mentioned in the previous article, and the stateless system is not the problem of resource competition and deadlock, which promotes the robustness of the system, and the other advantage of stateless system is the convenience of horizontal expansion.

Here I list the disadvantages of a single primary key generation system I don't think that this solution is entirely undesirable, depending on the specific business scenario, according to my experience, I haven't found a suitable scenario for using a single primary key generator.

One of the other features of my proposed scheme is to ensure that the data is distributed evenly across the physical table, uniform distribution can guarantee the load balance of different physical tables, so there will be no system hotspot, and not allow a server to do less than other servers and idle resources, evenly allocated resources can effectively use resources, Reduce the cost of production to improve the efficiency of production, but the uniform distribution of data often brings us a lot of trouble with business operations.

After splitting the database horizontally, we have to consider the problem of horizontal scaling, for example, if we have used 3 servers in advance to complete the horizontal split, if the system runs to a certain stage, the table has encountered a storage bottleneck, we have to scale the database, so if our horizontal split solution began to design a bad, Then the expansion of the time will encounter a lot of trouble.

The above questions will be my next article in the discussion, today is written here, I wish you a happy life.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.