How to build tens of millions of PV Web sites daily (iii) sharding

Last Update:2014-07-16 Source: Internet

Author: User

Tags database join database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, many large-scale website in order to deal with the increasingly complex business scenarios, through the use of divide and conquer the entire website business into different product lines, for example, those large domestic shopping transactions site they will own home page, shops, orders, buyers, sellers, etc. split different product lines, divided into different business teams responsible for ；

Group-to-technology also divides a site into many different applications based on the product line, with each application maintained with a standalone deployment. The application can be linked through a hyperlink (each of the navigation links on the first page points to a different app address), or through Message Queuing, or, of course, by accessing the same database storage system to form an associated complete system the schema is as follows:

Distributed services, as business splits are getting smaller, storage systems become larger, the overall complexity of application systems increases exponentially, and deployment maintenance becomes more difficult, as all applications are connected to all database systems, and in tens of thousands of server-sized sites, the number of these connections is the server-scale squared Causes insufficient database connection resources to deny service.

Since every application needs to perform many of the same business operations, such as user management, commodity management, and so on, these shared services can be extracted and deployed independently. These reusable business-connected databases provide shared services, while application systems only need to manage the user interface to perform specific business operations through distributed service invocation of shared business services

How does the database sharding?

The segmentation of

Data (sharding) can be divided into two segmentation modes according to the type of its segmentation rules. One is to separate tables (or schemas) into different databases (hosts), which can be referred to as vertical (vertical) Segmentation of data, the other is based on the data in the table of the logical relationship, the data in the same table by a certain conditions split into multiple databases (hosts), This segmentation is called horizontal (horizontal) slicing of data. The biggest feature of vertical slicing is the simple rules, the implementation is more convenient, especially suitable for each business between the coupling degree is very low, mutual influence is very small, business logic is very clear system. In this system, it is easy to split the tables used by different business modules into different databases. Depending on the table to split, the impact on the application is also smaller, the split rule will be relatively simple and clear. Horizontal segmentation is relatively slightly more complex than vertical slicing. Because to split different data from the same table into different databases, the split rule itself is more complex than the table name for the application, and later data maintenance is more complex. When one (or some) of our tables have a particularly large amount of data and access, and still can't meet performance requirements by vertically slicing them on separate devices, then we have to combine vertical and horizontal segmentation, and then slice vertically, then horizontally, to solve the performance problems of this very large table. In this paper, we will analyze the implementation of the three kinds of data segmentation methods, such as vertical, horizontal and combined segmentation, and the integration of data after segmentation. 14.2 Vertical segmentation of data Let's take a look at how the vertical segmentation of data is a segmentation method. Vertical segmentation of data can also be referred to as vertical segmentation. Think of the database as a large chunk of chunks of "data blocks" (tables), we cut these "chunks" vertically, and then scatter them across multiple database hosts. Such a segmentation method is a vertical (longitudinal) data segmentation. A good architecture design of the application system, its overall function is certainly composed of many functional modules, and each function module needs to correspond to the database is one or more tables. In architecture design, the more unified the interaction points of each function module, the less the coupling degree of the system, the better the maintainability and expansibility of each module. Such a system, it is easier to achieve vertical segmentation of data. The clearer the function module, the lower the coupling, the easier the rule definition for vertical segmentation of data. Can completely according to the function module to the data segmentation, the different function module's data holds in the different database host, can easily avoid the cross-database Join existence, simultaneously the system architecture is also very clear.

Of course, it is very difficult to have the system to do all the functions of the table is completely independent, do not need to access the other side of the table or a table of two modules to Join operation. In this case, we have to evaluate the tradeoffs based on the actual application scenario. The decision is that the application will have to store the tables related to the join in the same database, or let the application do more things, that is, the program completely through the module interface to obtain data from different databases, and then complete the join operation in the program. In general, if the load is relatively not very large system, and the table association is very frequent, it may be the database concessions, a few related modules together to reduce the work of the application of the program can reduce more workload, is a feasible scenario. Of course, through the concession of the database, so that multiple modules to centralize the sharing of data sources, in fact, the introduction of the tacit acquiescence of the modular architecture of the increase in the development of coupling, may make the future of the architecture more and more deteriorated. Especially when the development to a certain stage, found that the database can not bear the pressure of these tables, have to face the re-segmentation, the cost of the architecture can be far greater than the initial time. Therefore, in the database to vertical segmentation, how to slice, to what extent, is a comparative test of people's problems. Only by balancing the costs and benefits of each aspect in a real-world scenario can you analyze a split plan that really suits you. For example, in the example database of the sample system used in this book, we analyze it briefly and then design a simple segmentation rule to split vertically and vertically. System functions can be basically divided into four functional modules: users, group messages, albums and events,

correspond to these tables as follows: 1. User Module table: user,user_profile,user_group,user_photo_album2. Group Discussion table: Groups,group_message,group_message_content,top_message3. Albums related tables: Photo,photo_album,photo_album_relation,photo_comment

4. Event Information table: At the beginning of a brief look, no module can be separated from other modules, there is a relationship between the module and the module, can not be segmented? Of course not, we have a little more in-depth analysis, we can find that although the various modules used by the table are related, but the relationship is relatively clear, but also relatively simple. The main existence between group discussion module and user module is related by user or group relation. The General Association of the time will be through the user's ID or nick_name and the ID of the group to be associated, through the interface between the modules do not cause too much trouble; The album module is only associated with the user module through the user. The correlation between the two modules is basically the content that is associated with the user ID, simple and clear, the interface is clear, the event module may be related to each module, but only focus on the ID information of the objects in each module, also can be easily split. Therefore, our first step can be the database according to the function module related to the table for a vertical split, each module involved in the table alone into a database, module and module of the table association between the application system side through an excuse to handle. As shown in the following:

After such vertical slicing, the service can be divided into four databases to provide service, and the service ability is increased several times.

Advantages of vertical slicing

1. The splitting of the database is simple and clear;

2, the application module is clear and clear, easy to integrate;

3. Convenient and easy to locate data maintenance;

Disadvantages of vertical slicing

1, some table association can not be completed at the database level, need to be completed in the program;

2, for access to extremely frequent and large data volume of the table still has a quiet performance, not necessarily meet the requirements;

3, transaction processing is relatively more complex;

4, the segmentation to a certain extent, the expansion will encounter restrictions;

5. Read-through segmentation can lead to complex system transitions and difficult maintenance.

For vertical slicing, it is difficult to find a better solution to the data segmentation and transaction problems in the database level. In practical cases, the vertical segmentation of database is mostly corresponding to the module of application system, and the data source of the same module is stored in the same database, which can solve the problem of data association inside the module. Between modules, the application provides the required data to each other through the service interface. While this does increase in the overall number of operations on the database, it is beneficial in terms of overall system scalability and modularity of the architecture. There may be a slight increase in the single response time for some operations, but the overall performance of the system is likely to be somewhat improved. The problem of scaling bottlenecks can only be solved by relying on the data-level segmentation architecture that will be introduced in the next section.

　　　Horizontal segmentation of data

The above section analyzes the vertical segmentation of data, which is then analyzed in the horizontal segmentation of the data. The vertical segmentation of data can be easily understood as the partitioning of the data according to the table, and the horizontal segmentation is no longer based on the table or function module to slice. In general, simple horizontal slicing is the main way to spread an extremely mundane table into multiple tables, with a subset of the data in each table, according to some rule of a field. To put it simply, we can understand the horizontal segmentation of the data as a segmentation of data rows, that is, some rows in a table are sliced into one database, and some other rows are sliced into other databases. Of course, in order to make it easier to decide which database the rows of data are being sliced into, the Shard always needs to follow a specific rule. The range of a Time Type field, or the hash value of a field of a character type, based on a specific number of fields, depending on a number type field. If most of the core tables in the entire system can be associated with a field, then this field is naturally a choice for horizontal partitioning, and, of course, very special to use. Generally speaking, as the internet is very popular Web2.0 type of Web site, basically most of the data can be linked through the member user information, many of the core tables may be very suitable for the membership ID for the horizontal segmentation of data. And like the Forum community discussion system, it is easier to slice, it is very easy to follow the forum number for the horizontal segmentation of data. After slicing, there is basically no interaction between the libraries. As with our example system, where all data is associated with the user, we can split the data from different users into different databases based on the user's horizontal splitting. Of course, the only difference is that the groups table in the user module is not directly related to the user, so the groups cannot be split horizontally according to the user. For this particular case table, we can completely stand alone and put it in a separate database. In fact, this approach can be described in the previous section of the "vertical segmentation of data" method, I will in the next section more detailed introduction of this vertical segmentation and horizontal segmentation of the simultaneous use of the joint segmentation method. So, for our sample database, most of the tables can be sliced horizontally based on the user ID. Different user-related data are sliced and stored in different databases. For example, all user IDs are modeled by 2 and stored in two different databases respectively. Each table that is associated with a user ID can be sliced this way. In this way, basically each user-related data is in the same database, even if it needs to be associated, it can be very simple association. We can display the information about horizontal slicing more intuitively:

Advantages of horizontal slicing

1, the Table association is basically able to complete the database end;

2, there will be some ultra-large data volume and high load of the table encountered a bottleneck problem;

3, the application side of the overall structure changes relatively small, transaction processing is relatively simple;

4, as long as the segmentation rules can be defined, it is basically more difficult to meet the limitations of extensibility;

Disadvantages of horizontal slicing

1, the segmentation rules are relatively more complex, it is difficult to abstract a can meet the entire database segmentation rules;

2, later data maintenance difficulty has increased, manual positioning of data is more difficult;

3. The coupling degree of each module of the application system is high, which may cause some difficulties in the migration and splitting of the data behind.

　　　　The use of vertical and horizontal joint segmentation

　　　　In the above two sections, we understand the implementation of the two methods of "vertical" and "horizontal" and the architecture information after the segmentation, and also analyze the advantages and disadvantages of the two architectures respectively. But in the actual application scenario, except that the load is not too big, the business logic is relatively simple system can solve the extensibility problem by one of the two methods above, I am afraid that most of the other business logic is slightly more complicated, the system load of the system is larger, cannot Through any of the above data segmentation method to achieve good extensibility, but need to combine the above two methods of segmentation, different scenarios using different segmentation methods. In this section, I will combine the pros and cons of vertical slicing and horizontal slicing to further refine our overall architecture and further enhance the scalability of the system. In general, all the tables in our database are difficult to correlate with one (or a few) fields, so it is difficult to simply solve all the problems by just slicing the data horizontally. Vertical segmentation can only solve some problems, for those systems with very high load, even if only a single table can not be a single database host to bear its load. We must combine the two methods of "vertical" and "horizontal" to make full use of the advantages of both to avoid their shortcomings. The load on each application is growing step after time, and most architects and DBAs choose to start with a vertical split of the data at the beginning of a performance bottleneck, because the cost is first and foremost in line with the maximum input-output ratio pursued during this period. However, with the continuous expansion of the business, the system load continues to grow, after a period of stable system, after the vertical split of the database cluster may again overwhelmed, encountered a performance bottleneck. How do we decide this time? Is the module further subdivided, or is there another way to solve it? If we continue to subdivide the modules again as we did at the beginning of the data, we may in the near future encounter the same problems that we are facing now. And with the continuous refinement of the module, the application system architecture will become more and more complex, the whole system is likely to appear out of control situation. At this point we have to solve the problem here through the advantage of the horizontal segmentation of data. Moreover, we do not have to use the data at the time of horizontal segmentation, the previous data vertical segmentation of the results, but on the basis of the advantage of horizontal segmentation to avoid the shortcomings of vertical segmentation, to solve the problem of increasing system complexity. The horizontal split of the drawbacks (the rule is difficult to unify) has been solved by the vertical segmentation, so that the horizontal split can be done handy. For our sample database, let's say we started with vertical segmentation of the data, but as the business grew, the database system encountered bottlenecks, and we chose to refactor the database cluster's architecture. How to Refactor? Considering the previousThe vertical segmentation of the data is done, and the module structure is clear and clear. And the growth of the business is getting more and more fierce, even now further split the module, will not persist for too long. We chose to split horizontally on the basis of vertical segmentation. Each database cluster that has experienced a vertical split has only one function module, and all the tables in each feature module are basically associated with a field. If the user module all can be cut through the user ID, the group discussion module is divided by the group ID, the album module is based on the album ID to the segmentation, the final Event notification information table to take into account the time limit of the data (only access to a recent event segment information), then consider the time to slice. Shows the entire architecture of the Shard:

In fact, in many large-scale application systems, the two data segmentation methods, vertical slicing and horizontal cutting, are basically co-existent, and constantly alternating, in order to continuously increase the system's ability to expand. When dealing with different scenarios, we also need to take into account the respective limitations of these two methods, as well as their respective advantages, and use different combinations at different times (load pressures).

Benefits of Joint Segmentation

1, can make full use of vertical segmentation and horizontal segmentation of their respective advantages to avoid their own shortcomings;

2, to maximize the system scalability to improve;

Disadvantages of Joint segmentation

1, the database system architecture is more complex, more difficult to maintain;

2, the application architecture is also relatively more complex.

For more information on how the database sharding, refer to <<mysql performance Tuning and architecture design >>

Transferred from: http://www.cnblogs.com/xiaocen/p/3736037.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More