Basic idea and segmentation strategy of database Sharding

Source: Internet
Author: User
Tags db2 one table database sharding

At present, the majority of applications take two kinds of sub-database table rules

    1. MoD mode
    2. DayOfWeek Series Date Mode (all week 1 data in a library/table, or all? month data in a library table)

The two approaches have an essential feature, namely, dispersion plus periodicity.

For example, a table with a primary key to 3 to the remainder of the way the library or sub-table:

As the amount of data increases, the amount of data per table or library grows. When the amount of data in a table or library grows to a limit, to add a library or add a table,
It is necessary to do data migration to accomplish the discretization of this kind of sub-database sub-table algorithm. For example, when expanding from 3 to 5:

It is necessary to re-classify the data originally classified by MOD3 and MOD5, which inevitably leads to data migration. Data for each table is reassigned to multiple new tables
Similar examples such as 7 libraries/tables from DayOfWeek, to be expanded to 31 libraries/tables with dayofmonth, data migration is also required.

The problem with data migration is that

    1. Business must be released at least two times
    2. Write tools specifically to guide the data. It is difficult to make a unified tool because of the differences between different businesses. Currently almost every business writes a set of
    3. To solve problems such as increment, full amount, time point, data inconsistency, etc.

How to avoid the data migration when the data volume expands to the existing library table limit, and the addition of the database table?
The usual data growth tends to grow over time. With the development of the business, the amount of data is increasing over time. (Does not grow over time,
For example, one day you suddenly need to import large amounts of data from another system, which can be imported entirely by DBAs based on existing sub-database table rules, so this is not considered. )

Taking into account the characteristics of data growth, if we divide the library by an incremental range in fields that represent time growth, you can avoid data migration
For example, if the ID is a global sequence that grows over time, you can divide the library with the scope of the ID: (Global sequence can be implemented using TDDL now or zookeeper)
ID in 0–100 in the first library, 1.002 billion in the second, 2.003 billion in the 3rd (represented by M for millions of data)

Or take the Time field, for example, a field that represents the time the record was created, in the time period of this field gmt_create_time in range

In this way, when the amount of data is increased to the upper limit of the first few libraries/tables, the level of the library table continues to increase, the original data does not need to be migrated
But this approach poses a hot issue: when the current amount of data reaches the range of a library table, all the insertions are centered on the library/table.

Therefore, in the premise of satisfying basic business functions, the sub-database sub-table scheme should avoid two problems as far as possible:

1. Data migration
2. Hot Spots

How can you avoid data migration and avoid inserting updated hot issues?
Combining the advantages of a discrete library/sub-table and a continuous sub-library/Sub-table, if it is important to write hotspots and new data evenly in each library, while ensuring easy horizontal expansion, consider such a pattern:

"Horizontal Scaling Scale-out Scenario mode one"

Phase one: A library of DB0 within 4 tables, id%4:

Phase II: Increased DB1 library, T2 and T3 entire table relocated to DB1

Phase III: Increase the DB2 and DB3 libraries, T1 the entire table relocated to db2,t3 the entire table relocation DB3:

For the sake of regular expression, through internal name mapping or otherwise, we interchange the names and locations of DB1 and DB2:

Dbrule: "DB" + (id% 4)
Tbrule: "T" + (ID% 4)

In this 3-phase expansion scenario, each expansion requires only one outage release, and no data migration is required. Only a full table relocation is required for the outage release.
This is a lot simpler than the data redistribution in each table, whether it's a development or a DBA.

If further database design and deployment can be done on each table one hard disk, then the expansion process as long as the original machine to remove a piece of hard disk,
Insert on the new machine, complete the whole table move! Downtime can be greatly reduced.

In MySQL, you can use libraries as tables. Start 4 DB instances on a physical machine, doubling the machine each time, and relocating the library directly to the new machine.
This does not need to change from the beginning to the end, it has always been:

Dbrule: "DB" + (id% 4)
Tbrule: "T" + (ID% 4)

That is, logically always keep 4 library 4 tables, one library per table. This practice is also the current shop line image space to adopt the practice.

One drawback of the above scenario is that the amount of data in a single table has been growing in the process from one library to 4 libraries. Performance issues can be caused when the amount of data in the single table exceeds a certain range. such as indexing problems, historical data cleanup problems.
In addition, when the number of tables to be reserved is exhausted, to the 4 physical library 1 tables per library stage, and then the expansion, it is inevitable to start from the table. So let's consider a scenario in which the data cap in the table does not grow:

"Horizontal Scaling Scale-out Scenario mode two"

Phase one: A database, two tables, RULE0 = id% 2

Sub-Library rule dbrule: "Db0″
Sub-table rule tbrule: "T" + (ID% 2)

Stage two: When the data volume of the library is close to 10 million, and the data volume of the single table is close to 5 million, the capacity is expanded (data volume is only an example, the specific expansion capacity depends on the database and the actual pressure condition):
Add a database DB1 to migrate the db0.t1 whole table to the new library DB1.
Each library adds one table, and future 10m-20m data mod2 are written to these 2 tables: t0_1,t1_1:

Sub-Library rules Dbrule:

"DB" + (id% 2)

Sub-table Rule tbrule:

if (ID < 10 million) {

Return "T" + (id% 2); The data before 10 million is still placed in the T0 and T1 tables. T1 table moved from DB0 to DB1 Library

}else if (ID < 20 million) {

Return "T" + (ID% 2) + "_1"; The data after 10 million is placed in two tables of two libraries: T0_1,t1_1

}else{

throw new IllegalArgumentException ("ID outof range[20000000]:" + ID);

}

So the new data of 10M will be distributed evenly in DB0 and DB1; Inserting updates and querying hotspots can still be evenly distributed across each library.
Each library has both old data and growing new data. The data for each table is still under 5 million control.

Phase three: When the capacity of two libraries is approaching the upper limit and continues to scale horizontally, do the following:
Added two new libraries: DB2 and DB3. With the ID% 4 cent library. The remainder 0, 1, 2, and 3 correspond to the subscript of the DB respectively. T0 and T1 do not change,
Migrating db0.t0_1 whole table to DB2; Migrating the Db1.t1_1 whole table to DB3
20m-40m data mod4 is divided into 4 tables: T0_2,t1_2,t2_2,t3_2, placed in 4 libraries:

The new sub-database table rules are as follows:

Sub-Library rules Dbrule:

if (ID < 20 million) {

20 million previous data, 4 tables were placed in 4 libraries

if (ID < 10 million) {

Return "DB" + (id% 2); The original T0 table is still db0, and the T1 table is still DB1

}else{

Return "DB" + ((id% 2) +2); The original T0_1 table moved from DB0 to DB2; T1_1 table moved from DB1 to DB3

}

}else if (ID < 40 million) {

Return "DB" + (id% 4); More than 20 million of the data, evenly divided into 4 libraries

}else{

throw new IllegalArgumentException ("ID out of range. ID: "+id");

}

Sub-table Rule tbrule:

if (ID < 20 million) data prior to {//20 million, table rules are identical to the original, see phase Two

if (ID < 10 million) {

Return "T" + (id% 2); 10 million before data, still placed in T0 and T1 tables

}else{

Return "T" + (ID% 2) + "_1"; Data after 10 million, still placed in t0_1 and T1_1 tables

}

}else if (ID < 40 million) {

Return "T" + (ID% 4) + "_2"; More than 20 million of the data is divided into 4 tables T0_2,t1_2,t2_2,t3_2

}else{

throw new IllegalArgumentException ("ID out of range. ID: "+id");

}

Over time, when the first stage of the T0/T1, the second stage of T0_1/t1_1 gradually become the historical data, no longer used, you can directly truncate off the entire table. Eliminates the hassle of historical data migrations.

The sub-list rules of the above 3 stages have been fully supported in tddl2.x, please consult the TDDL team for details.

"Horizontal scaling Scale-out scenario model three"

Non-multiplier extension: If you don't want to add two libraries from stage two to phase three above? Try the following scenario:

Before migration:

New library for Db2,t0, T1 are placed in DB0,

T0_1 full table migration to DB1
T1_1 full table migration to DB2

After migration:

At this point DB0 degenerated into a library of read and update of the old data. Hot spots for new data are evenly distributed across DB1 and DB2
4 cannot be divisible by 3, so if you scale from 4 table 2 libraries to 3 libraries, do not do row-level migrations and ensure that hot spots are evenly distributed and seemingly impossible to complete.

Of course, if you do not limit only two tables per library, you can also implement the following:

Less than 10M of T0 and T1 are put to DB0, to mod2 divided into two tables, the original data unchanged
10m-20m, to mod2 divided into two tables t0_1, T1_1, the original data unchanged, respectively relocated to DB1, and DB2
T_0, t_2, t_3 tables with mod3 evenly distributed to 3 db libraries over 20M
This DB1 contains the oldest two tables, and the latest 1/3 data. The DB1 and DB2 tables contain the new two old tables T0_1, T1_1, and the latest 1/3 data.
Both old and new data can be distributed evenly.

Word:
Two kinds of rule mappings (functions):

    1. Discrete mapping: such as mod or DayOfWeek, this type of mapping can be a good solution to hot issues, but it brings data migration and historical data problems.
    2. Continuous mapping, such as by ID or gmt_create_time continuous range mapping. This type of mapping avoids data migration, but it also brings hot issues.

The two mutually reinforcing mapping rules, discrete mapping and continuous mapping, solve the conflicting problems of hotspot and migration.
We used only the discrete mapping, introduced the continuous mapping rules, the two combined, carefully designed,
You should be able to design rules that meet any trade-offs between avoiding hotspots and reducing migrations.

Based on the above considerations, the design and configuration of the sub-database sub-table rules must meet the following requirements in the long run

      1. Can dynamically push changes
      2. Rules can be stacked hierarchically, old rules can continue to be used under new rules, and the new rules extend over a wider scale to support the compatibility of old and new rules and avoid data migrations
      3. When using mod mode, it is best to choose 2 of the number of points of the library sub-table, so as to facilitate later cutting.

Database sub-list (sharding) series (v) a sharding scale-up scheme for free planning without data migration and modification of routing code database sub-list (sharding) series (iv) Transaction database sub-list of multiple data sources (sharding) series (iii) Considerations regarding the use of frameworks or self-development and sharding implementation level database sub-list (sharding) series (ii) Global primary key generation Strategy database sub-list (sharding) series (i) Split implementation strategy and example demo about vertical Segmentation vertical The basic idea and the segmentation strategy of Sharding's granular database sharding

Basic idea and segmentation strategy of database Sharding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.