Thoughts on the evolution of large-scale website technology (IV.)--Storage bottlenecks (4)

Source: Internet
Author: User

If the database needs to be split horizontally, this is actually a very happy thing, because it represents the company's business is growing rapidly, for developers that is an endless project can be done, although it will feel very busy, but people live full, the heart is also practical.

database Horizontal Splitting simply means that a table in the original database is split vertically into a separate database and a separate table to further divide the tables that would otherwise be a whole into multiple tables, each of which is stored in a separate database. When the table is split horizontally, the original data table becomes a logical concept, and the business meaning of this logical table requires a lot of physical tables to work together, so that the database table is split horizontally, then our operation on this table is beyond the database itself to provide us with the existing means, In other words, our operations on the table will exceed the processing power of the database itself, and this time I need to design the relevant program to compensate for the lack of database capabilities, which is the biggest technical difficulty of the database horizontal splitting.

The horizontal split of the database is an upgraded version of the database vertical split, which is more like a parent-child relationship in the inheritance mechanism, so the problem of the join query encountered by the vertical split and the problem of the distributed transaction are still present, since the table is physically disassembled to increase the dimension of the logical table. This also adds more dimensions to the two problems encountered in the vertical split, so the problem of the join query in the horizontal split and the distributed transaction becomes more complex. Horizontal splitting apart from the vertical splitting of two challenges, it also creates new technical challenges, which are as follows:

Problem one: When the table of a database is split horizontally, the primary key design of the table becomes very difficult;

Problem two: The original single-table query logic will face challenges.

In preparing this article, I see some of the information that also mentions some of the challenges that are:

Puzzle Three : The design of the foreign key becomes very difficult after the horizontal splitting of the table;

Puzzle Four : This puzzle is a new operation for the data, roughly meaning what rules we have in place to store the data that needs to be stored in the specific physical data table that was split.

Puzzle three question, I have given the answer in the last article, here I do a certain supplement, in fact, the foreign key problem in the vertical split has already existed, but in the vertical split when we did not talk about this problem, this is mainly I set a premise that the data table in the most primitive data modeling phase will abandon all foreign key design, And the logic of the foreign key to the service layer to complete, we have to do our best to reduce the database to bear the computational pressure, in fact, in addition to reducing the database operation pressure, we also want to be stored as the table of atoms relative independence, unrelated, then the most direct way to do this is to get rid of the table and Table association symbol: Foreign This allows us to build a solid foundation for vertical splits and horizontal splits of future databases.

As for the problem four, in fact, the essence of the problem is the sub-database after the specific data where the problem, and the data stored in the table is the key obstacle is actually the primary key, imagine, we design a table, all fields we are allowed to be empty, but the table has a field is absolutely not empty, that is the primary key, The primary key is the identity of the data in the database, so we are in the primary key design can reflect the ground rules of the data, then the problem four will be solved. So I'll focus on the first two horizontal splits.

The first is the horizontal split in the primary key design problem, throw away all the primary key can represent business meaning, the key essence of the database is the uniqueness of a record in the expression table, we can design the database by an absolutely non-repeatable field to represent the primary key, you can also use multiple fields to express the uniqueness of the Using a field to represent the primary key, which is already a very atomic operation, cannot be further modified, but if you use multiple fields to represent a primary key, you will encounter a problem with the horizontal split, which is mainly reflected in the database on which the data is landed. About the influence of the primary key on the data landing I will focus on the relevant knowledge after the exposition, here is to mention that when the joint primary key we can set a no business meaning of the field substitution, but this depends on the scene, I tend to combine the values of the various fields of the Union primary key into a field to represent the primary key, If some friends think this will lead to data redundancy, then you can simply remove the original to do the joint primary key related fields are represented by a field, but the merge field when the use of a delimiter, so that the service layer to facilitate business split.

As mentioned above, here I give the first principle of horizontal split primary key design: The primary key design of a table that is split horizontally is best represented by a field .

If our primary key is only the expression record uniqueness, then the horizontal split time is relatively simple, for example, in the Oracle database has a sequence mechanism, which is actually a self-increment algorithm, the self-increment mechanism almost all relational databases have, is also our usual favorite use of the primary key field design, if we want to split the table, using the Self-increment field, while this self-increment field is only used to express record uniqueness, then the horizontal split time to deal with a lot easier, I give two classic scenarios, the scenario is as follows:

Scenario One : The self-increment column has a set step characteristics, if we are going to split a table into two physical tables, then we can be in one of the tables of the primary Key's self-increment column of the step design to 2, the starting value of 1, then its self-increment law is 1,3,5,7, and so on, We can also set the step size of the physical table to 2, if the starting value is 2, then the self-increment rule is 2,4,6,8 and so on, so that the primary key of the two tables will never repeat, and we do not have to do two additional physical tables corresponding logical association. The potential benefit of this scenario is that the size of the stride length and the granularity associated with the horizontal data splitting, and the amount of expansion we have for horizontal splitting, for example, when we design the step size to 9, the theoretical horizontal split of the physical table can be expanded to 9.

Scenario Two : Splitting the physical table we allow it to store the maximum amount of data, we actually in advance through a certain business technical rules roughly estimated, if we estimate a table we have to store up to 200 million, then we can set the rule of self-increment, the first physical table self-increment column starting from 1, The step is set to 1, the second physical table's self-increment column starts with 200 million, the step is also set to 1, the self-increment column does the maximum limit, and so on.

So how do we handle the primary key distribution if the primary key of the table is not the self-increment column, but the unique field of the business design? This scenario is typical, for example, the trading site will have the order form, water table such a design, the order table has the order number, the running list has the serial number, these numbers are defined by a certain business rules and ensure its uniqueness, then the previous self-added column solution will not be able to complete their horizontal split the primary key problem, So how do we deal with this situation? We carefully aftertaste the level of the database split, it is actually similar to the distributed cache, the primary key of the database is equivalent to the key value in the distributed cache, then we can follow the distributed cache scheme to design the primary key model, the scheme is as follows:

Scheme one : Using integer hash to find the remainder of the algorithm, the string if the hash operation will draw a value, this value is the unique flag of the string, if we slightly change the contents of the string below, the computed hash value is definitely different, two different hash values correspond to two different strings, A hash value has only a single string, and the Md5,sha in the cryptographic algorithm calculates a uniquely labeled hash value using the hash algorithm, which can be used to determine whether the data has been tampered with by matching the hash value. But the final value of most hashing algorithms is a combination of a word multibyte number, where I use an integer hash algorithm so that the computed hash is an integer. Next we need to count the number of servers we use to do the horizontal split, if the number of servers is 3, then we divide the computed integer hash by the number of servers to take the modulo calculation, the resulting remainder to select the server, the algorithm schematic is as follows:

Scenario Two : is the upgrade consistency hash of scenario one, the most consistent hash is to ensure that when we want to expand the number of physical data tables and a server in the physical table cluster failure time will be reflected, this issue I will discuss in detail the physical database expansion problem, So we don't start the discussion here.

From the above, we find that when the database is split horizontally, we set the algorithm is through the primary key uniqueness, according to the characteristics of the uniqueness design of the primary key, the final data landed on which physical database is also determined by the design principle of the primary key, Back to what I mentioned above, if the original database table uses a federated field to design the primary key, then we must first merge the Federated primary key field and then use the algorithm above to determine the ground rules of the data, although not merging a field does not look too cumbersome, but in my years of development, the uniqueness of the field is divided into multiple fields, is equal to the primary key added dimension, the more fields, the greater the dimension, to the specific business computing we have to always pay attention to these dimensions, the result is very easy to make mistakes, I personally think that if the database has reached the level of split stage, then the importance of database storage is greatly enhanced, In order to make the storage characteristics of the database pure and clean, we have to try to avoid increasing the complexity of the database design, such as removing the foreign key, and here the Merge Union field is a field, in order to reduce the difficulty, even if the necessary redundancy is worth.

Solve the problem of the primary key uniqueness after the horizontal split of the database table there is a more direct solution, this is also a lot of people encounter such problems naturally think of the method, that is, the primary key generation rules into a primary key generation system, placed on a single server unified generation, each time the new data primary key is obtained from this server, The primary key generation algorithm is very simple, many languages have the function of computing uuid, UUID is based on the relevant hardware information of the server to calculate the world's only mark, but here I did not first come up with this scheme, because it compared to the solution of my previous project shortcomings too much, below I want to count its shortcomings, Specific as follows:

disadvantage one : The primary key generation is placed on the external server, so we have to complete the transmission of primary key value through network communication, and the network is the most efficient way in the computer system, so it will affect the efficiency of the new data, especially when the volume of data is very large, new operations are very frequent, This shortcoming will be magnified a lot;

disadvantage Two : if we use the UUID algorithm to generate the main key algorithm, because the UUID is dependent on a single server, then the entire horizontal split of the physical database cluster, the primary key generator will become the entire system of short board, and is a key short board, the primary key generator if the server fails, The entire system will not be used, and a table needs to be split horizontally, and the split table is a business table, then this table in the overall system of importance is naturally very high, if it does a horizontal split after a single point of failure, which is fatal to the entire system. Of course, some people must say, since there is a single point of failure, then we do a cluster system, the problem is not solved it? This idea can really solve the problem I described above, but I have mentioned before, the reality of software system development we have to stick to a principle that is to have a simple solution to choose a simple solution to solve the problem, the introduction of the cluster is the introduction of distributed systems, so that the development of the system to increase the difficulty and operational risk, If our solution above solves our problem, why do we have to make such a complicated plan?

cons Three : Using an external system to generate a primary key makes our horizontal split database scheme more stateful, and the scenarios I mentioned above are stateless, stateful systems interact with each other, such as using external systems to generate primary keys, and when data operations increase, Will inevitably cause in the primary key system of resources competition, if we on the primary key system of competitive state processing is not good, it is possible to cause the primary key system is deadlocked, this will produce the 503 errors I mentioned earlier, and the stateless system is no resource competition and deadlock problems, which enhances the system's robustness, Another advantage of stateless systems is that horizontal scaling is convenient.

Here I list the disadvantages of the single primary key generation system not to say I think this solution is completely undesirable, depending on the specific business scenario, according to the author my experience has not yet found a suitable scenario using a separate primary key generator.

The proposal I put forward in the above also has a feature is to ensure that the data in different physical table evenly distributed, uniform distribution can guarantee the load balance of different physical tables, so that there will not be a system hotspot, and will not make a server than other servers do less and idle resources, evenly distributed resources can effectively use resources, Lower the cost of production to improve production efficiency, but uniform distribution of data often gives us a lot of trouble with business operations.

Horizontally split the database after we also consider the problem of horizontal expansion, for example, if we use 3 servers in advance to complete the horizontal split, if the system runs to a certain stage, the table has encountered a storage bottleneck, we have to scale the database, then if our horizontal split plan to start the design is not good, Then the expansion time will encounter a lot of trouble.

The above questions will be discussed in my next article, write here today, I wish you a happy life.

Thoughts on the evolution of large-scale website technology (IV.)--Storage bottlenecks (4)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.