Basic idea and segmentation strategy of Database sharding (RPM)

Last Update:2018-06-08 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the basic idea

The basic idea of

sharding is to split a database into multiple parts onto different databases (servers). This can alleviate the performance problem of a single database. Less strictly speaking, the database of large amounts of data, if it is because of many tables and more data, it is appropriate to use vertical segmentation, that is, closely related (such as the same module) of the table is sliced out on a server. If there are not many tables, but there is a lot of data for each table, it is appropriate to split the table's data into multiple databases (servers) by a rule (for example, by ID hash) for horizontal segmentation. Of course, the reality is more of these two situations mixed together, this time need to choose according to the actual situation, may also be combined with vertical and horizontal segmentation, so that the original database into a similar matrix can be infinitely expanded database (server) array. The following is a detailed description of vertical slicing and horizontal slicing. The most important feature of the

vertical slicing is that the rules are simple and easy to implement, especially for the coupling between the different business The

Horizontal segmentation is relatively slightly more complex than vertical slicing. Because you want to split different data from the same table
In different databases, the split rule itself is more complex to split than the table name for the application, after
Data maintenance will be more complex.

Let's consider the segmentation of data from a common situation: on the one hand, all tables in a library are not usually concatenated by a single table, which implies that horizontal slicing is almost always about a table that is closely related to a small rubbing (in fact, a vertically sliced block), and not for all tables. On the other hand, some very high-load systems, even if only a single table could not bear its load through a single database host, means that vertical slicing alone is not a complete solution to the question. As a result, most systems use vertical and horizontal slicing together, vertically slicing the system, and selectively slicing horizontally for each small rubbing table. The entire database is then cut into a distributed matrix.

Second, the segmentation strategy

As mentioned earlier, the segmentation is done by first vertical slicing and then horizontal slicing. The result of vertical slicing is just the foreshadowing of horizontal slicing. The idea of vertical segmentation is to analyze the aggregation relationship between tables, and put together the closely related tables. In most cases it may be the same module, or the same "aggregation". The "gathering" here is what the field-driven design says about aggregation. In the vertical tangent table aggregation, find "root element" (here "root element" is the domain-driven design "aggregation root"), according to the "root element" for horizontal segmentation, that is, starting from the "root element", all and its direct and indirect associated data into a shard. The likelihood of such a cross-shard association is very small. The application does not have to interrupt existing inter-table associations. For example: For social networking sites, almost all of the data will eventually be linked to a user, and segmentation based on the user is the best choice. For example, the forum system, the user and forum two modules should be divided in the vertical segmentation in two Shard, for the Forum module, BBS is obviously the aggregation of roots, so according to forum for horizontal segmentation, It is natural to put all the posts and replies in the Forum in a shard with the forum.

For shared data data, if it is a read-only dictionary table, maintaining a copy in each shard should be a good choice, so you don't have to interrupt the correlation. If it is a cross-node association between the general data, it must be interrupted.

In particular, there are subtle changes to the segmentation strategy when vertical and horizontal segmentation are performed simultaneously. For example, when only vertical segmentation is considered, the tables that are divided together can maintain arbitrary correlation, so you can divide the table by "function module", but once the horizontal segmentation is introduced, the relationship between the tables is greatly constrained. Typically, you can only allow a primary table (a table that is hashed with that table ID) and its multiple secondary tables to retain an association, that is, when vertical and horizontal segmentation is done vertically, the slice in the vertical direction will no longer be divided with the "function module", but rather finer-grained vertical slicing, which is the granularity of the domain-driven design The concept of "aggregation" coincides with, or even is, identical, and the main table of each shard is the aggregation root in an aggregation! In this way, you will find that the database is fragmented too scattered (the number of Shard will be more, but the table in Shard is not many), in order to avoid managing too many data sources, make full use of each database server resources, you can consider the business close, and two or more shard with a similar data growth rate (the primary table data volume at the same order of magnitude) are placed in the same data source, each shard remains separate, they have their own primary table and are hashed with their own primary table ID, except that their hash modulo (that is, the number of nodes) must be consistent. （

This paper focuses on the basic ideas of sharding and the theory of the segmentation strategy, about more detailed implementation strategies and reference examples please refer to my other blog post: Database sub-list (sharding) series (a) split implementation strategy and sample demonstration

）

1. Transaction issues:
There are two possible scenarios for solving a transactional problem: distributed transactions and a simple comparison of the two sets of scenarios under which the application and the database co-control implement the transaction.
Scenario One: Using Distributed Transactions
Advantages: The database management, simple and effective
Cons: High performance costs, especially shard more and more
Scenario Two: Co-controlled by applications and databases
principle: Splitting a distributed transaction across multiple databases into multiple
small transactions on a single database, and is controlled by the application
each small transaction.
Advantages: Advantages in Performance
Cons: Requires the application to be flexible in transaction control. If you use
The transaction management of Spring, the change will face some difficulties.
2. Cross-node join issues
The problem of cross-node join is unavoidable as long as it is slicing. But good design and segmentation can reduce the occurrence of this kind of situation. The common practice of solving this problem is to implement the query in two times. Identify the ID of the associated data in the result set of the first query, and initiate a second request to get the associated data based on those IDs.

3. Cross-node Count,order by,group by and aggregation function issues
These are a kind of problem, because they all need to be calculated based on all data sets. Most agents do not automatically process the merge work. Solution: Merge on the application side, similar to resolving cross-node join problems, with results obtained on each node, respectively. Unlike join, the query for each node can be executed in parallel, so many times it is much faster than a single big table. However, if the result set is large, the consumption of application memory is a problem.

The granularity of vertical slicing refers to the level of association tables that are allowed to be placed in a shard when vertical slicing is done. This problem has a great impact on application and sharding implementations.

The more association interrupts, the more the affected join operations, the greater the compromise the application makes, but the simpler the single-table routing and the smaller the association with the business, the easier it will be to use a uniform mechanism for processing. The extreme scenario in this direction is to interrupt all connections, each with a routing rule that can be handled automatically using a uniform mechanism or framework. For example, a framework such as amoeba, its routing can and can only be routed through the characteristics of SQL (such as a table ID).

Conversely, if the association interrupts less, then the join operation is limited, the application needs to make a compromise less, but the table routing will become complex, and the greater the relevance of the business, the more difficult to use the uniform mechanism processing, the need for each data request to implement a separate route. The extreme scenario in this direction is that all tables are in a shard, that is, there is no vertical slicing, so no association is interrupted. Of course this is very extreme, unless the entire database is simple and the number of tables is small.

The actual granularity of the control needs to be combined with "business tightness" and "tabular data volume" Considerations in general:

If the table is grouped together tightly, and the amount of data is not large, the growth rate is very slow, it is appropriate to put in a shard, do not need to do horizontal segmentation;

If the amount of tabular data is large and rapid growth, it is bound to the vertical segmentation on the basis of the horizontal segmentation, horizontal segmentation means that the original single shard will be subdivided into a number of smaller Shard, Each shard has a primary table (that is, a table that will be hashed with that table ID) and associated tables with multiple phases.

In a word, the granularity of vertical slicing presents the advantages and disadvantages in two opposite directions and the mutual game. What architects need to do is to balance the benefits of the project with the best of both worlds.

Basic idea and segmentation strategy of Database sharding (RPM)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More