A data slicing (Sharding) strategy based on ID feature

Source: Internet
Author: User
Tags comments database sharding advantage

If you have an application, as the business gets better, the amount of data involved is getting bigger, and you're going to have to deal with scaling the system (Scale). A typical extension method is called upward scaling (Scale up), which means improving the performance parameters of the system by using better hardware. Another approach, called outward scaling (Scale out), is to achieve the same effect by adding additional hardware, such as a server. From the "Hardware Cost" or "system limit" point of view, "outward scaling" is generally better than "scaling up", so most of the scale of the system will be to some extent to consider the "outward" way. Because many system bottlenecks are on the data storage, a kind of data architecture method called "Data Fragment (Database Sharding)" comes into being, this article will discuss a typical way of implementing this kind of data architecture.

Brief introduction

Data fragmentation naturally means that the overall data is distributed across multiple storage devices (hereinafter referred to as "data partitioning" or "zoning") so that the amount of data per storage device is relatively small enough to meet the performance requirements of the system. It is noteworthy that there are many strategies for system fragmentation, for example, the following are common:

Depending on the ID feature: for example, modulo the ID of a record, the result is a few, then the record is placed on a data partition that is numbered.

Based on the time range: for example, the first 1 million user data is in the 1th partition, and the second 1 million user data is placed in the 2nd partition.

Based on the retrieval table: first, according to the ID to find the partition in a table, and then go to the target partition to find.

......

None of these data slicing strategies has the absolute advantage of choosing which strategy is based entirely on the system's business or data characteristics. It is worth emphasizing that: data fragmentation is not silver bullet, it will bring some benefits to the performance and scalability of the system (scalability), but also bring a lot of complexity to the system development. For example, if there are two records on separate servers, then if there is a business that establishes an "association" for them, it is likely that a record of "association" must be placed in each of the two partitions. In addition, if you value data integrity, transactions across data partitions immediately become performance killers. Finally, if there are some businesses that need to be looked at globally, it is difficult for the data slicing strategy to have any advantage to the system.

Data fragmentation is important, but be sure to think twice before you can use it. Once set foot on this ship shanghaied, often do not succeed then martyrdom, difficult turn back. In my experience, I was very impressed (and certainly successful) with an effort to misuse the data slicing strategy, so I'm getting more cautious with the data slicing strategy at the moment.

So now, let's discuss a more common data slicing strategy.

Policy description

Here I first describe an extremely simple business:

The system has the user, the user may publish the article, the article will have the comment

Can find articles based on user

Can find comments based on articles

So what do I do if I want to do a data fragment on such a system? Here we can use the first approach mentioned above, that is, to model the ID of the record and select the partition where the data is located according to the result. Based on the query requirements described in the following two business, we will supplement the rules for partitioning policies:

All articles of a user are in the same data partition as this user.

All the comments of an article, with this article in a data partition.

You might say that it seems as long as you guarantee that "the same user article is within the same data partition", right? Yes, but I am here to make the article and the user in the same partition, but also to facilitate many additional operations (such as in the relational database to connect). So suppose that we have 4 partitions of data, then the internal entries might be:

Partition 0 Partition 1
User 4

Article 8

Article 12

Comment 4

Comment 16

User 12

Article 4

User 1

Article 5

Article 9

Comment 13

Comment 17

User 5

Article 13

Partition 2 Partition 3
User 2

Article 10

Article 14

Comment 6

Comment 10

User 10

Article 4

User 7

Article 7

Article 11

Comment 3

Comment 15

User 11

Article 4

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.