Database sub-table __ Database

Source: Internet
Author: User
Tags stub svn dedicated server


first, the basic idea

The basic idea of sharding is to divide a database into several parts and put it on a different database (server) to alleviate the performance problem of a single database. For a database of massive data, if it is because of more tables and more data, it is appropriate to use vertical segmentation, that is, the relationship between the close (such as the same module) of the table to split out on a server. If you don't have a lot of tables, but there's a lot of data for each table, then it's a good time to split the table's data into multiple databases (server) with some sort of rule, such as hash by ID. Depending on the actual situation to make a choice, may also be integrated use of vertical and horizontal segmentation. 1, Vertical segmentation

Vertical segmentation of data can also be called vertical segmentation. Think of the database as a large chunk of a large chunk of "data block" (table), we vertically cut the pieces of data, and then spread them over multiple database hosts, such a segmentation method is a vertical (vertical) data segmentation.

System functions can be basically divided into the following four functional modules: users, group messages, photo albums, and events, respectively, corresponding to the following tables:

1. Users module table user, User_profile, User_group, User_photo_album

2. Group discussion table groups, Group_message, Group_message_content, top_message

3. Album related table photo, Photo_album, Photo_album_relation, photo_comment

4. Events Information Table Event

The relationship between modules:

1. There is an association between a group discussion module and a user module, either through a user or a group relationship. A general association is associated with the user ID or nick_name and the ID of the group, and the implementation of the interface between modules does not cause much trouble.

2. Photo Album module only with user module exists through the user's association. The correlation between these two modules is basically the content that is associated through the user ID, which is simple and clear and has a clear interface.

3. Event modules may be associated with each module, but they are only concerned with the information IDs of the objects in each module, and can be easily split.

So, the first step can be the database according to the function of the table related to a vertical split, each module involved in the table to a single database, module and module of the table association between the application system through the interface to deal with. As shown in the following illustration:

After such vertical segmentation, the services that can only be provided through a single database are split into four databases to provide services, and the service capability is naturally increased by several times.

The advantages of vertical segmentation

The U-database split is simple and clear, and the split rules clearly

U application module is clear and easy to integrate

U data maintenance is easy and easy to locate.

The disadvantage of vertical segmentation

The U-Part Table Association cannot be completed at the database level and needs to be completed in the program

U for the extremely frequent access and the data is very large table still has a quiet performance, does not necessarily meet the requirements

U transaction processing is relatively more complex

When you reach a certain level of segmentation, extensibility will encounter limitations

U too much segmentation may bring the system transition complex and difficult to maintain 2, horizontal segmentation

The horizontal segmentation of the data, in general, the simple horizontal segmentation is mainly a very ordinary access to a table in a certain field of some of the rules to spread into multiple tables, each table contains a portion of data. In short, we can interpret the horizontal segmentation of the data as the segmentation of the data rows, that is, to slice some rows of the table into one database, and some other lines to the other.

For our sample database, most of the tables can be split horizontally according to the user ID, and the different user-related data are split and stored in different databases. If you take all user IDs through 2 modulo, and then store them in two different databases, each of the tables associated with the user ID can be split like this. In this way, basically each user-related data is in the same database, even if it needs to be associated, it can be very easy to relate.

We can use the following image to show more intuitive level segmentation related information:


The advantages of horizontal segmentation

U Table Association can be done on the database side basically

You do not have problems with some very large data and high load tables that are experiencing bottlenecks

U application-side overall schema changes are relatively small

U transaction processing is relatively simple

U as long as the segmentation rules can be defined well, it is basically difficult to encounter extensibility constraints

The disadvantage of horizontal segmentation

The U-splitting rule is more complex and it is difficult to abstract a segmentation rule that satisfies the whole database.

U-POST data maintenance difficulty increased, manual location of data is more difficult

The coupling degree of each module in U application system is high, which may cause some difficulties to the migration and split of the data in the following 3, vertical and horizontal joint segmentation

In general, all the tables in our database are difficult to relate to by a single (or a few) fields, so it is difficult to solve all the problems simply by slicing the data horizontally. and vertical segmentation can only solve some of the problems, for those systems with very high load, even if only a single table can not bear its load through a single database host, we must combine the "horizontal" and "vertical" two kinds of segmentation methods to use simultaneously, make full use of the advantages of both to avoid its shortcomings.

The load on each application system is growing step-by-step, and at the start of a performance bottleneck, most architects and DBAs will choose to split the data vertically, but as the business expands and the system loads continuously, after a period of system stability, After the vertical split of the database cluster may again overwhelmed, encountered a performance bottleneck.

At this point we have to use the edge of the data segmentation to solve the problems encountered here. For our sample database, assuming that at the beginning, we have a vertical segmentation of the data, but as the business continues to grow, the database system encountered bottlenecks, we chose to refactor the database cluster architecture. How to Refactor. Considering the vertical segmentation of the data, the structure of the module is clear and clear. And the growing momentum of business growth, even now further split module, it will not persist for too long. We have chosen to split horizontally on the basis of vertical segmentation.

Each of the database clusters that has undergone a vertical split has only one functional module, and all of the tables that each feature die on are basically associated with a field. If all user modules can be split by user ID, the group discussion module is divided by the group ID, the photo album module is based on the album ID, the final Event notification table takes into account the time limit of the data (only to access the information of the most recent event segment), then consider splitting by time.

The following illustration shows the entire schema after the Shard:

When dealing with different application scenarios, it is also necessary to take into account the limitations of the two methods, as well as their respective strengths, and use different combinations at different times (load pressures).

The advantages of joint segmentation

You can make full use of vertical segmentation and horizontal segmentation of their respective advantages and avoid their own defects

U Maximize System Scalability

The disadvantage of joint segmentation

U database system architecture is more complicated and more difficult to maintain

U application architecture is also relatively more complex

split implementation strategy and sample demo

Part I: Implementation strategy

1. The preparatory phase

Before the database is partitioned (sharding), the System business logic and database schema need to be fully understood. To draw a database ER diagram, to divide the Shard on the basis of graph, to ensure the sober thinking.

2. Analysis phase

1. Vertical segmentation

Vertical segmentation is based on the principle of: Close the business, the tables are closely related to each other, such as the table of the same module. In conjunction with the prepared database ER or domain model diagram, follow the swim lane concept in the activity diagram, a lane represents a shard, dividing all tables into different lanes. The following analysis example shows this practice. This way the table structure between multiple databases is different.

2. Horizontal segmentation

After vertical segmentation, the data volume and the growth rate of the table in Shard need to be analyzed further to determine whether horizontal segmentation is required. The table structures in these databases are exactly the same.

2.1 If the table data is divided into a slow growth, after the product on-line can meet long enough time can be hosted by a single database, then do not need horizontal segmentation, all tables reside in the same shard, all the relations between the tables will be maximum reservation, while ensuring the freedom of writing SQL, It is not easy to be restricted by clauses such as join, group BY, order by, and so on.

2.2 If the table is divided into a large amount of data, rapid growth, the need for further horizontal segmentation. Further horizontal segmentation is done in this way:

2.2.1 Combines business logic with the relationships between tables, dividing the current shard into smaller shard, typically, these smaller shard each contains only one primary table (a table that will hash with that table ID) and multiple secondary tables associated with or indirectly associated with it. The condition of a shard table with multiple sheets is the inevitable result of horizontal segmentation. In this way, the number of Shard will increase rapidly. If each shard represents a separate database, then managing and maintaining the database will be very cumbersome, and these small shard often have only two or three tables, to create a new library, the utilization rate is not high, so after the horizontal segmentation is completed, you can do a "reverse merge", that is: the business close to the , and two or more shard with similar data growth rates (the amount of primary table data on the same order of magnitude) are placed on the same database, logically they remain separate shard, have their respective primary tables, and are hashed according to the IDs of their respective primary tables. The difference is that their hash modulo (that is, the number of nodes) must be consistent. In this way, the number of tables on each database node is relatively average.

2.2.2 All tables are partitioned to the appropriate shard, all of the table associations across Shard must be interrupted, and when writing SQL, joins, group by, and order times across Shard are banned, and these issues need to be resolved at the application level.

3. Implementation phase

If the project at the beginning of the development of the decision to carry out the sub-Library, then strictly follow the analysis of the design plan to advance. If it is implemented in the medium-term architecture evolution, in addition to building the infrastructure to implement sharding logic, it is necessary to filter the original SQL and modify the SQL that is affected by the sharding.

Part Two: Sample demo

The following uses Jpetstore (the Pet Shop's E-commerce system) to demonstrate how to do the sharding in the analysis phase. Jpetstore from the original Ibatis of the official demo version, SVN address is:

Http://mybatis.googlecode.com/svn/tags/java_release_2.3.4-726/jpetstore-5


As the system is simpler, it is easy to see from the model that it consists mainly of three modules: users, products and orders. Then the vertical segmentation scheme is out. Next look at the horizontal segmentation, if we consider from a real pet store, the possibility of a single table of data explosion should be account and order, so the two tables need to be horizontal segmentation. For the product module, if it is a real system, the number of product and Item is not very large, so only vertical segmentation is sufficient, that is, (Product,category,item,iventory, Supplier) Five tables on a database node (no horizontal segmentation, no more than two database nodes). But as a demo, we assume that the product module also has a lot of data that requires us to do horizontal segmentation, so the analysis shows that the module will split two shard: one Is (Product (main), Category), the other is (Item (main), Iventory, Supplier), at the same time, we think: These two shard in the data growth should be similar, and very close to the business, then we can put these two shard on the same database node, item and product data in the hash of the same model. According to the drawing method described in the previous article, we get the following sharding schematic diagram:


A few more points for this picture:

1. Use lanes to represent physical shard (a database node)

2. If the vertical cut out of the Shard for further horizontal segmentation, but the common one physical shard, then use the dotted box to indicate that it is logically an independent shard.

3. The dark entity represents the primary table

4. x indicates the need to interrupt the Inter-table Association Three, global primary key generation policy

Once the database is split into multiple physical nodes, we can no longer rely on the primary key generation mechanism of the database itself. On the one hand, the ID generated by a partitioned database is not guaranteed to be unique globally, and the application needs to obtain an ID before inserting the data for SQL routing.

The Flickr development team wrote in 2010 about a primary key generation measurement strategy used by Flickr, and said that the program's actual performance on Flickr was also very satisfying, and that it was somewhat similar to the general sequence table scenario, but it was a good solution to the bottleneck and the single point of the problem. is a very reliable and efficient global primary key generation scenario.


The whole idea of Flickr is to build more than two database ID generation servers, each with a sequence table that records the current IDs of each table, but the number of IDs in sequence is the server, and the starting values are staggered, This is equivalent to hashing the generation of IDs onto each server node. For example, if we set up two database ID generation server, then let one of the sequence table ID starting value is 1, each growth step is 2, the other sequence table ID starting value of 2, each growth step is also 2, The result is that the odd ID will be generated from the first server, even IDs are generated from the second server, so that the pressure on the build ID is distributed evenly across the two servers, with the control of the application, and when a server fails, the system can automatically switch to another server to get the ID, Thus the system's fault tolerance is ensured.

Here are a few details about this program:

1. Flickr's database ID generation server is a dedicated server with only one database on the server, and tables in the database are used to generate sequence. This is also because the two database variables Auto-increment-offset and auto-increment-increment are variables at the database instance level.

2. The stub field in the table in the Flickr scenario is just a char (1) Not null stub field, not a table name, so in general, a sequence table has only one record and can generate IDs for multiple tables at the same time, if the ID of the table needs to be contiguous, You need to create a separate sequence table for the table.

3. The program uses the MySQL last_insert_id () function, which also determines that the sequence table can only have one record.

4. Inserting data using replace into, this is a very flattering approach, mainly to use MySQL's own mechanism to generate IDs, not only because it is simple, but also because we need IDs in the way we set (initial and step) to generate.

5. SELECT last_insert_id () must be in the replace into statement under the same database connection to get the new ID just inserted, otherwise the value returned is always 0

6. The sequence table in the programme uses the MyISAM engine, for higher performance, note: The MyISAM engine uses a table-level lock, and MyISAM reads and writes to the table serially, so you don't have to worry about getting the same ID for two reads at a time (plus, you should not need to sync the program) , each requested thread will get a new connection, and there is no shared resource that needs to be synchronized. After the actual comparison test, using the same sequence table for ID generation, MyISAM engine than InnoDB performance much higher.

7. The operation of the sequence table can be achieved with pure JDBC for greater efficiency, and experiments show that even using spring JDBC performance is not as fast as pure jdbc.

To implement this scenario, the application will also need to do some processing, mainly two aspects of the work:

1. Automatic equalization Database ID generation server access

2. Ensure that when a database ID generation server is invalidated, the request can be forwarded to another server for execution. Iv. Implementation level of sharding

Through the previous chapters, we have made it clear that data segmentation through database can greatly improve the scalability of the system. However, when the data in the database is stored in a different database host after vertical and/or horizontal segmentation, the biggest problem that the application system faces is how to make these data sources better integrated.

Implemented through proxy between application server and database

A proxy is added between the application server and the database, and the application sends a data request to the data through the proxy, and the agent resolves the SQL to the target shard based on the configured routing rules, because it is completely transparent to the application, So it has become a choice of many sharding products. In this regard more well-known products are the official MySQL agent tool: MySQL proxy and a Chinese development of products: amoeba. The MySQL proxy itself does not implement any sharding logic, but as a proxy for the MySQL database, providing developers with a place to embed sharding logic, using LUA as the programming language, which is a problem for many teams to consider. Amoeba is a dedicated proxy product for read-write separation and sharding, which is simple to use without any programming language, and only needs to be configured through XML. However, amoeba does not support transactions (transaction information is erased when a request for transaction information from an application arrives at amoeba, so even a single point of data access will not have a transaction) has been a mishap. Of course, this depends on the positioning of the product and design concept, we can only say that for those who require a very high system, amoeba is not suitable.

Amoeba for MySQL is primarily dedicated to MySQL database solutions, front-end application-requested protocols and backend connections to the data source database must be MySQL. For any application of the client, amoeba for MySQL and a MySQL does not make any difference, any use of the MySQL protocol client requests, can be amoeba for MySQL parsing and processing, the following figure can tell us amoeba for Schema information for MySQL:

Amoeba Use guide:

Http://docs.hexnova.com/amoeba/index.html

Some of the other implementation layers can refer to:

http://blog.csdn.NET/bluishglc/article/details/7766508

transaction processing for multiple data sources

six or one sharding expansion schemes that support free planning without data migration and modification of routing code

Sharding expansion--the system maintenance cannot bear the heavy

Any sharding system, after running for a period of time, the data will accumulate to the current node size can be loaded with the upper limit, this time need to expand the database, that is, add new physical nodes to share the data. If the system uses an ID hash based routing, the team needs to recalculate the target shard for all data based on the new node size and migrate it over, which is a huge maintenance burden for the team, and if the system is routed in incremental intervals ( If every 10 million data or every one months of data is stored on one node, although data migration can be avoided, it is possible to bring a "hot spot" problem, that is, the recent system of reading and writing are concentrated on the newly created node (many systems have such characteristics: the new data read and write frequency is significantly higher than the old data), Thus affecting the system performance. Faced with this dilemma, sharding expansion appears unusually difficult.

In general, the "ideal" expansion plan should endeavour to meet the following requirements:

1. It is best not to migrate the data (in any case, data migration is a problem that makes the team pressure mountain big)

2. Allows free planning of capacity expansion and node storage load based on hardware resources

3. Can evenly distributed data read and write, avoid "hot" problem

4. Ensure that nodes that have reached the storage limit are no longer written to data

At present, there are not many excellent schemes to avoid data migration, there are two relatively feasible, one is to maintain a record data ID and target shard mapping of the corresponding relationship, write, the data are written into the new expansion of the Shard

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.