Basic idea and segmentation strategy of sharding Database

Source: Internet
Author: User

This paper focuses on the basic ideas of sharding and the theory of the segmentation strategy, about more detailed implementation strategies and reference examples please refer to my other blog post: Database sub-list (sharding) series (a) split implementation strategy and sample demonstration

First, the basic idea

The basic idea of sharding is to reduce the performance of a single database by splitting a database into multiple parts onto different databases (servers). Less strictly speaking, the database of large amounts of data, if it is because of many tables and more data, it is appropriate to use vertical segmentation, that is, closely related (such as the same module) of the table is sliced out on a server. If there are not many tables, but there is a lot of data for each table, it is appropriate to split the table's data into multiple databases (servers) by a rule (for example, by ID hash) for horizontal segmentation. Of course, the reality is more of these two situations mixed together, this time need to choose according to the actual situation, may also be combined with vertical and horizontal segmentation, so that the original database into a similar matrix can be infinitely expanded database (server) array. Below is a detailed description of vertical slicing and horizontal slicing.

The most important feature of vertical slicing is the simple rules, the implementation is more convenient, especially suitable for the coupling degree between the business non-
Often low, interacting very small, the business logic is very clear system. In this kind of system, it is easy to do different industries
The tables used by the service module are split into different databases. Depending on the table to be split, the impact on the application is also
Smaller, the split rule will be simpler and clearer. (This is called "Share Nothing").



Horizontal segmentation is relatively slightly more complex than vertical slicing. Because you want to split different data from the same table
In different databases, the split rule itself is more complex to split than the table name for the application, after
Data maintenance will be more complex.



Let's consider the segmentation of data from a common situation: on the one hand, all tables in a library are not usually concatenated by a single table, which implies that horizontal slicing is almost always about a table that is closely related to a small rubbing (in fact, a vertically sliced block), and not for all tables. On the other hand, some very high-load systems, even if only a single table could not bear its load through a single database host, means that vertical slicing alone is not a complete solution to the question. As a result, most systems use vertical and horizontal slicing together, vertically slicing the system, and selectively slicing horizontally for each small rubbing table. The entire database is then cut into a distributed matrix.

Second, the segmentation strategy

As mentioned earlier, the segmentation is done by first vertical slicing and then horizontal slicing. The result of vertical slicing is just the foreshadowing of horizontal slicing. The idea of vertical segmentation is to analyze the aggregation relationship between tables, and put together the closely related tables. In most cases it may be the same module, or the same "aggregation". The "gathering" here is what the field-driven design says about aggregation. In the vertical tangent table aggregation, find "root element" (here "root element" is the domain-driven design "aggregation root"), according to the "root element" for horizontal segmentation, that is, starting from the "root element", all and its direct and indirect associated data into a shard. The likelihood of such a cross-shard association is very small. The application does not have to interrupt existing inter-table associations. For example: For social networking sites, almost all of the data will eventually be linked to a user, and segmentation based on the user is the best choice. For example, the forum system, the user and forum two modules should be divided in the vertical segmentation in two Shard, for the Forum module, BBS is obviously the aggregation of roots, so according to forum for horizontal segmentation, It is natural to put all the posts and replies in the Forum in a shard with the forum.

For shared data data, if it is a read-only dictionary table, maintaining a copy in each shard should be a good choice, so you don't have to interrupt the correlation. If it is a cross-node association between the general data, it must be interrupted.

In particular, there are subtle changes to the segmentation strategy when vertical and horizontal segmentation are performed simultaneously. For example, when only vertical segmentation is considered, the tables that are divided together can maintain arbitrary correlation, so you can divide the table by "function module", but once the horizontal segmentation is introduced, the relationship between the tables is greatly constrained. Typically, you can only allow a primary table (a table that is hashed with that table ID) and its multiple secondary tables to retain an association, that is, when vertical and horizontal segmentation is done vertically, the slice in the vertical direction will no longer be divided with the "function module", but rather finer-grained vertical slicing, which is the granularity of the domain-driven design The concept of "aggregation" coincides with, or even is, identical, and the main table of each shard is the aggregation root in an aggregation! In this way you will find that the database is fragmented (the number of Shard will be more, but the table in Shard is not many), in order to avoid managing too many data sources, make full use of each database server resources, you can consider the business is similar, and Two or more shard with similar data growth rates (the primary table data volume at the same order of magnitude) are placed in the same data source, each shard remains separate, they have their own primary table, and are hashed with their respective master table IDs. The difference is that their hash modulus (that is, the number of nodes) must be consistent. (

This paper focuses on the basic ideas of sharding and the theory of the segmentation strategy, about more detailed implementation strategies and reference examples please refer to my other blog post: Database sub-list (sharding) series (a) split implementation strategy and sample demonstration


1. Transaction issues:
There are two possible scenarios for solving a transactional problem: distributed transactions and a simple comparison of the two sets of scenarios under which the application and the database co-control implement the transaction.
Scenario one: Using Distributed transactions
Advantages: The database management, simple and effective
Cons: High performance costs, especially shard more and more
Scenario Two: Co-controlled by applications and databases
Principle: Splitting a distributed transaction across multiple databases into multiple
Small transactions on a single database, and is controlled by the application
Each small transaction.
Advantages: Advantages in Performance
Cons: Requires the application to be flexible in transaction control. If you use
The transaction management of spring, the change will face some difficulties.
2. Cross-node Join issues
The problem of cross-node join is unavoidable as long as it is slicing. But good design and segmentation can reduce the occurrence of this kind of situation. The common practice of solving this problem is to implement the query in two times. Identify the ID of the associated data in the result set of the first query, and initiate a second request to get the associated data based on those IDs.

3. Cross-node Count,order by,group by and aggregation function issues
These are a kind of problem, because they all need to be calculated based on all data sets. Most agents do not automatically process the merge work. Solution: Merge on the application side, similar to resolving cross-node join problems, with results obtained on each node, respectively. Unlike join, the query for each node can be executed in parallel, so many times it is much faster than a single big table. However, if the result set is large, the consumption of application memory is a problem.

Resources:

"MySQL performance Tuning and architecture design"

Note: This picture is from the "MySQL Performance Tuning and architecture Design" book

Basic idea and segmentation strategy of sharding Database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.