Basic idea and segmentation strategy of database Sharding

Last Update:2017-07-21 Source: Internet

Author: User

Tags in domain database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from:http://blog.csdn.net/bluishglc/article/details/6161475

This paper mainly introduces the basic idea of sharding and the theory of slicing strategy. For more detailed implementation strategies and references please refer to my blog: Database sub-list (sharding) series (i) Split implementation strategy and demo sample Demo

First, the basic idea

The basic idea of sharding is to reduce the performance of a single database by splitting a database into multiple parts onto different databases (servers).

Not very strictly speaking. For databases with large amounts of data, assuming that there are many tables and lots of data, it is appropriate to use vertical slicing, which is to slice out tables that are closely related (for example, the same module) and put them on a server. Suppose there are not many tables. But there's a lot of data for each table. This is a good time for horizontal segmentation, which is to slice the table's data into multiple databases (servers) by some rule (for example, by ID hash).

Of course. Many of the other things in reality are mixed together, and it is time to make a choice based on the actual situation. Vertical and horizontal slicing may also be used in combination. The original database is then cut into a database (server) array that is infinitely extensible like a matrix. The following is a detailed description of vertical and horizontal slicing, respectively.

The most important feature of vertical slicing is the simple rules, the implementation is more convenient, especially suitable for the coupling degree between the business non-
Often low, interacting very small, the business logic is very clear system. In such a system, it can be very easy to do different industries
The tables used by the service module are split into different databases. Depending on the table to be split, the impact of the application is also
Even smaller. The split rule will be simpler and clearer. (This is called "Share Nothing").

Horizontal segmentation is relatively slightly more complex than vertical slicing. Because you want to split different data from the same table
Divided into different databases, the split rule itself is more complex to split than the table name for the application. After
Data maintenance will be more complex.

Let's consider the segmentation of data from a common situation: on the one hand, the whole table of a library is often impossible to concatenate by a single table, which implies that horizontal segmentation is almost always a closely related table for a small twist (actually a vertically sliced block). And it can't be done for all the tables. On the other hand, some high-load systems, even though only a single table can not be loaded by a single database host, means that vertical slicing alone does not completely solve the problem. As a result, most systems use vertical and horizontal slicing together, vertically slicing the system, and selectively slicing horizontally for each small rubbing table.

The entire database is then cut into a distributed matrix.

Second, the segmentation strategy

As mentioned earlier, the segmentation is done by first vertical slicing and then horizontal slicing. The result of vertical slicing is just the foreshadowing of horizontal slicing. The idea of vertical segmentation is to analyze the aggregation relationship between tables, and put together the closely related tables.

In most cases it may be the same module, or the same "aggregation". The "gathering" here is what the field-driven design says about aggregation.

In the vertical tangent table aggregation, find "root element" (here "root element" is the domain-driven design "aggregation root"), according to the "root element" for horizontal segmentation. That is, start with the "root element". Put all the data directly and indirectly associated with it into a shard.

The likelihood of such a cross-shard association is very small. The application does not have to interrupt existing inter-table associations. For example: for social sites. Almost all of the data will eventually be linked to a user, and segmentation based on the user is the best choice. For example, the forum system, the user and the Forum two modules should be divided in the vertical segmentation in two Shard, for the Forum module, BBS is obviously the aggregation of roots, so according to forum for horizontal segmentation, It is natural to put all the posts and replies in the Forum in a shard with the forum.

For shared data data, it is a good idea to maintain a copy of each shard if it is a read-only dictionary table. This does not have to interrupt the association relationship. The assumption is that the cross-node association between the general data must be interrupted.

In particular, when vertical and horizontal slicing is performed at the same time. There are some subtle changes to the segmentation strategy. For example: In the case of vertical slicing only, the tables that are divided together can maintain a random correlation, so you can divide the table by the "function module", but once the horizontal segmentation is introduced, the relationship between the tables is greatly constrained. It is usually only possible to agree that a primary table (the table with the table ID is hashed) and its multiple secondary tables retain an association, that is to say: when vertical and horizontal slicing at the same time, in the vertical direction of the segmentation will no longer be "functional module" division, but need more fine-grained vertical segmentation, This granularity coincides with the concept of "aggregation" in domain-driven design. It is even possible to say that the main table of every shard is the aggregation root in an aggregation!

In this way you will find that the database is fragmented (the number of Shard is much more, but the table in Shard is not many), in order to avoid managing too many data sources, make full use of each database server resources. It is possible to consider two or more shard that are close to the business and have a similar rate of data growth (the amount of primary table data at the same order of magnitude) into the same data source. Each shard is still independent, and they have their own main table. and hash them with their main table IDs, the only difference is that their hash modulus (that is, the number of nodes) must be consistent. （

This paper focuses on the basic ideas of sharding and the theory of segmentation strategy, on more careful implementation of the strategy and reference examples please refer to my blog: Database sub-list (sharding) series (a) split implementation strategy and demonstration sample demonstration

)

1. Transaction issues:
Solving a transactional problem there are two possible scenarios at the moment: distributed transactions and the implementation of transactions through application-to-database control a simple comparison of the two scenarios is made below.
Scenario One: Working with distributed Transactions
    Strengths: By database management, simple and effective
    Cons: High performance costs. Especially shard more and more.
Scenario two: Common control by applications and databases
     principle: Split a distributed transaction across multiple databases into multiple
            Small transactions above a single database, and the application to control the
            every small business.
     Strengths: Performance Advantages
     Cons: Requires the application to be flexible in transaction control. Suppose you use
           the transaction management of Spring, will face some difficulties in revising.
2. Problems with cross-node join
      only if you are slicing. The problem of cross-node join is unavoidable. But good design and segmentation can reduce the occurrence of this kind of situation.

The common practice of solving this problem is to implement the query in two times. The ID of the associated data is found in the result set of the first query, and the second request based on those IDs gets the associated data.

3. Cross-node Count,order by,group by and aggregation function issues
These are a kind of problem, because they all need to be calculated based on all data sets. Most agents do not take their own initiative to handle the merge work. Workaround: Similar to resolving cross-node join issues. The results are merged at the application end, respectively, on each node.

Unlike join, the query for each node can run in parallel. So very often it is much faster than a single big table.

However, assuming that the result set is very large, the consumption of the application memory is a problem.

References:

"MySQL performance Tuning and architecture design"

Note: This picture is from the "MySQL Performance Tuning and architecture Design" book

Related reading:

Database sub-list (sharding) series (v) a sharding scale-up scheme for free planning without data migration and change of routing code database sub-list (sharding) series (iv) Transaction database of multiple data sources sub-list (sharding) series (iii) On the use of framework or self-development and sharding implementation level considerations Database sub-list (sharding) series (ii) Global primary key generation Strategy database sub-list (sharding) series (i) Split implementation strategy and Demo sample demo about vertical slicing vertical The basic idea and the segmentation strategy of Sharding's granular database sharding

Basic idea and segmentation strategy of database Sharding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More