If you have an application, as the business gets better, the amount of data involved is getting bigger, and you're going to have to deal with scaling the system (Scale). A typical extension method is called upward scaling (Scale up), which means improving the performance parameters of the system by using better hardware. Another approach, called outward scaling (Scale out), is to achieve the same effect by adding additional hardware, such as a server. From the "Hardware Cost" or "system limit" point of view, "outward scaling" is generally better than "scaling up", so most of the scale of the system will be to some extent to consider the "outward" way. Because many system bottlenecks are on the data storage, a kind of data architecture method called "Data Fragment (Database Sharding)" comes into being, this article will discuss a typical way of implementing this kind of data architecture.
Brief introduction
Data fragmentation naturally means that the overall data is distributed across multiple storage devices (hereinafter referred to as "data partitioning" or "zoning") so that the amount of data per storage device is relatively small enough to meet the performance requirements of the system. It is noteworthy that there are many strategies for system fragmentation, for example, the following are common:
Depending on the ID feature: for example, modulo the ID of a record, the result is a few, then the record is placed on a data partition that is numbered.
Based on the time range: for example, the first 1 million user data is in the 1th partition, and the second 1 million user data is placed in the 2nd partition.
Based on the retrieval table: first, according to the ID to find the partition in a table, and then go to the target partition to find.
......
None of these data slicing strategies has the absolute advantage of choosing which strategy is based entirely on the system's business or data characteristics. It is worth emphasizing that: data fragmentation is not silver bullet, it will bring some benefits to the performance and scalability of the system (scalability), but also bring a lot of complexity to the system development. For example, if there are two records on separate servers, then if there is a business that establishes an "association" for them, it is likely that a record of "association" must be placed in each of the two partitions. In addition, if you value data integrity, transactions across data partitions immediately become performance killers. Finally, if there are some businesses that need to be looked at globally, it is difficult for the data slicing strategy to have any advantage to the system.
Data fragmentation is important, but be sure to think twice before you can use it. Once set foot on this ship shanghaied, often do not succeed then martyrdom, difficult turn back. In my experience, I was very impressed (and certainly successful) with an effort to misuse the data slicing strategy, so I'm getting more cautious with the data slicing strategy at the moment.
So now, let's discuss a more common data slicing strategy.
Policy description
Here I first describe an extremely simple business:
The system has the user, the user may publish the article, the article will have the comment
Can find articles based on user
Can find comments based on articles
So what do I do if I want to do a data fragment on such a system? Here we can use the first approach mentioned above, that is, to model the ID of the record and select the partition where the data is located according to the result. Based on the query requirements described in the following two business, we will supplement the rules for partitioning policies:
All articles of a user are in the same data partition as this user.
All the comments of an article, with this article in a data partition.
You might say that it seems as long as you guarantee that "the same user article is within the same data partition", right? Yes, but I am here to make the article and the user in the same partition, but also to facilitate many additional operations (such as in the relational database to connect). So suppose that we have 4 partitions of data, then the internal entries might be:
Partition 0 |
Partition 1 |
User 4 Article 8 Article 12 Comment 4 Comment 16 User 12 Article 4 |
User 1 Article 5 Article 9 Comment 13 Comment 17 User 5 Article 13 |
Partition 2 |
Partition 3 |
User 2 Article 10 Article 14 Comment 6 Comment 10 User 10 Article 4 |
User 7 Article 7 Article 11 Comment 3 Comment 15 User 11 Article 4 |