Common strategies for database sharding in Distributed Databases

Source: Internet
Author: User
Tags database sharding
In a large-capacity and high-load web system, database splitting can effectively improve database capacity and performance. In the early days of the beginner program, programmers usually like to design a single database and a single menu structure according to the traditional database design mode. After the data volume and concurrency reach a certain level, severe performance and maintenance problems

In a large-capacity and high-load web system, database splitting can effectively improve database capacity and performance. In the early days of the beginner program, programmers usually like to design a single database and a single menu structure according to the traditional database design mode. After the data volume and concurrency reach a certain level, serious performance problems and maintenance problems may occur. It is very painful to start optimization only when a problem occurs. Therefore, we should consider the problems that may occur after the system is set up.

Currently, some database policies adopt a single database structure and are distributed to several servers synchronously for read/write splitting. I personally think this strategy is very clumsy, but I still want to separate it, otherwise the memory of each machine will easily overhead.

Generally, you can only split tables with a large amount of data. There is also an important table that may be maintained, such as the document directory table, if there is a possibility of dumping data from other systems, it should also be removed. Otherwise, the directory table may be broken accidentally when the data is dumped, and the backup is forgotten.

The following is an analysis:

I. Time Structure

If the business system is highly time-sensitive, such as the article table of the news publishing system, you can design the database into a time structure. There are several structures by Time:

1) Panel

The table is similar:
Article_200901
Article_200902
Article_200903

In years, you can still use the month to customize the table, but if you use the date, there will be too many tables, and this is not necessary. It is generally recommended that the monthly score be used.

The difficulty of this method is that, if I want to list 20 pieces of data and the three tables have two results, it is very likely that the business needs to read three tables. If it takes a long time, there are dozens of tables, and each table is 0, isn't it necessary to read the entire system table? In addition, this structure is difficult to implement for paging.

Primary Key: In this system, the primary key is a 13-bit timestamp with a millisecond value. Do not use an automatic number. Otherwise, it is difficult to locate the table through the primary key. You can also use a time value during query, but it is cumbersome.

2) Archive

The table is similar:
Article_old
Article_new

To solve the disadvantages of the flat panel, we can adopt the time archive design. We can see that this system has only two tables. One is the old article table, and the other is the new article table. The new article table contains information for two months. The earliest articles of the two months are classified into the old table on a regular basis every day. In this way, the performance problem can be solved, because generally the news publishing system reads new content, but the old content reads less. Second, the function problem can be solved euphemistically, such as the problem mentioned on the panel, in the archive mode, you only need to read two tables at most.

The disadvantage of archive is that the capacity of the old table is relatively large. If the business permits, You can archive or directly clear the old content in the old table.

Ii. Forum Structure

If you split the table according to the section to which the article belongs, such as the news or sports section, on the one hand, the data volume of each table can be separated, and on the other hand, the mutual influence between sections can be minimized. If the data table of the news section is damaged or requires maintenance, it does not affect the normal operation of the sports section, thus reducing the risk. Forum structure is also commonly usedBbsSuch a system.

Plate structures also have several methods:

1) Response

The number of Forum sections is small and the form is relatively fixed. For example, news sections can be divided into news directories and news articles.

News _CatEgory
News_article
Sports_category
Sports_article

We can see that each section corresponds to a group of identical table structures, which is easy to understand. In terms of functions, because there are still some gaps between sections, there is not much need for joint queries, and development is easier than the time structure.

Primary Key: You still need to consider it. In this system, the primary key is a forum + timestamp, which can also be used simply by a timestamp or automatic number. Remember to include a forum in the query to locate the table.

2) Cold and Hot type

The downside is that if the number of Forum sections is large and you are not sure about it, there will be too many tables to be split. For example: Baidu post it. If you design a table by one entry, how many tables do you need?

Use this method.

Tieba _ Automobile
Tieba _ aircraft
Tieba _ rocket
Tieba _ unite

This table is a popular table for cars and rockets. It is defined as a newly created Section in the unite table. The table structure is enabled only when there are more than 10 thousand primary stickers. In this system, there must be a lot more unpopular sections than popular ones. These unpopular sections usually only have a few posts, which is too wasteful to open tables for them; at the same time, the number and access volume of popular sections are much larger than those of unpopular sections, which is very characteristic.

The unite table can also be extended into a hash table. Using the md5 code of the entry, it can be divided into n tables. I forget it. The first digit of md5 can be divided into 16 tables, and the two digits are 256 tables.

Tieba_unite_ AB
Tieba_unite_ac
...

Iii. Hash structure

Hash structures are usually used in scenarios such as blogs Based on users. In a blog system, there are several features: 1. There are a large number of users, 2. Each user sends a small number of articles. 3. The user sends articles irregularly. 4. Each user does not send many articles, but the total number is still very large. Based on these features, any of the aforementioned table sharding methods is not suitable. If there is no fixed validity period, it is not suitable for time splitting. There are many users, but they are also unpopular, therefore, it is not recommended to split the Forum (user.

The hash structure is mentioned above. Since it is difficult for each user to split directly, it is better to put a group of users into a table.

Blog_aa
Blog_ AB
Blog_ac
...

As mentioned above, the first two hash values of md5 can reach 1296 tables. If you don't think it is enough, add another one. The total number of tables can reach 46656. Isn't it enough?

The number of tables is too large, and it is difficult to create these tables. You can execute a statement to determine whether the table exists and create the table before inserting the table into the database in the program, it is very practical and does not consume much.

Primary Key: You still need to consider. In this system, the primary key is the user ID + timestamp, which can be used only by the time stamp or automatic number. However, remember to include the user name in the query to locate the table.

Iv. Total score structure

Based on the above structure, there are still a lot to come up with based on each business system. However, the Internet business is becoming more and more complex. Sometimes, a single splitting method cannot meet the requirements, and several splitting schemes need to be implemented together. In this case, the logic will be confusing. I have developed a system that only uses the hash structure and time structure in a mix, and I think the logic is quite complicated.

Therefore, in addition to table sharding, It is very advantageous to create a summary table based on the original single database and single table.Architecture. In this architecture, twice the data will be written to the database each time. Reading the data mainly depends on table splitting to improve performance. The summary table is used to implement functions that are difficult to implement after table splitting and is used for daily scheduled backup; in addition, the summary table and the sub-table are completely backed up. If any sub-table is damaged or the data is abnormal, the correct data can be read from the summary table and restored, and vice versa.

In the total score structure, the performance and maintainability of the summary table are questionable. My solution is to use a relatively stable service software and architecture in the summary table, such as oracle, or lvs + pgpool + PostgreSQL, to ensure data stability, table sharding uses lightweight mysql with a focus on speed. The ability to use different software and solutions for the total score table is also a major feature of the total score structure.



Summary: how to optimize the system by splitting tables is the most basic analysis based on business needs and characteristics. This article only provides several basic methods. You must first think about the specific work and never worry about it. If you use a mistake, you need to increase the workload by ten times.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.