Database Shard Shard Operations

Source: Internet
Author: User
Tags db2

2.1.1 What is data slicing

The word "Shard" means "fragments" in English, and the technical terminology associated with the database seems to be the earliest seen in massively multiplayer online role-playing games. "Sharding" is called "Shard". Sharding is not a new technology, but a relatively simple concept of software. It is well known that after MySQL 5, the data table partitioning feature, before this, many of the potential users of MySQL has a concern for MySQL scalability, and whether the partitioning function is a measure of the scalability of a database is a key indicator (not the only indicator). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on?  The answer is: sharding. Sharding is not a feature attached to a particular database software, but an abstraction on top of specific technical details, a solution for horizontal scaling (scale out, or scale-out, and scale-out), with the main purpose of exceeding the I/O capability limits of a single-node database server. Resolve database extensibility issues.

The data is distributed horizontally to different DB or table through a series of sharding rules, and the specific DB or table that needs to be queried is found through the corresponding DB route or table routing rule for query operations. "Sharding" is usually referred to as "horizontal segmentation", which is the focus of this paper. What are the specific ways of slicing and routing? With this in hand, the reader will inevitably have some questions, and here's a simple example: we explain the logs in a blog app, such as the log article (article) table, which has the following fields:


article_id (int), title (varchar), content (varchar), user_id (int)

In the face of such a table, how do we slice it? How do you distribute such data to tables in different databases? In fact, analysis of the application of the blog, we can not be difficult to draw the conclusion: the application of the blog, the user is divided into two categories: the browser and the owner of the blog. Browser browsing a blog, is actually in a specific user's blog under the browsing, and the owner of the blog to manage their own blog, but also in a specific user blog under the operation (in their own space). The so-called specific user, the field representation of the database is "user_id". This is the "user_id", which is the basis for the basis and rules of the sub-Library we need. we can do this by putting user_id all the article information in the 1~10000 into the article table in DB1, user_id all the article information in 10001~20000 into the DB2 table, and so on, until article. in this way, the article data is naturally divided into various databases, to achieve the purpose of data segmentation. The next problem to solve is how to find a specific database? In fact, the problem is simple and obvious, since the Sub-Library when we use the distinction between the field user_id, then it is natural that the process of database routing is certainly not user_id. Consider the blog application we just presented, whether it is to visit other people's blog or to manage their own blog, in short I have to know who the user of this blog, that is, we know the user_id of this blog, the use of this user_id, using the rules of the library, In turn to locate specific databases, such as user_id is 234, the use of the rules, it should be located to DB1, if user_id is 12343, using the rules, it should be positioned to DB2. And so on, using the rules of the sub-Library, the reverse route to the specific DB, this process we call "DB route".

Of course, considering that the DB design of data splitting is necessarily unconventional, unorthodox db design. So what kind of db design is the Orthodox db design?

The basic of our normal use is. Usually we will consciously design our database according to the paradigm, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the mechanism of its own shortcomings is relatively obvious (mentioned below). The above mentioned "consciously according to the paradigm design". Considering that the DB design of the data segmentation would violate the usual rules and constraints, in order to slice, we had to have redundant fields in the table of the database, used as a field of distinction or as a marker field called a library, such as user_id in the example above article (of course, Just now the example is not very good to reflect the redundancy of user_id, because user_id this field even if it is not divided into the library, but also to appear, we picked up the cheap bar). Of course, the appearance of redundant fields is not only in the scene of the sub-Library, in many large applications, redundancy is also necessary, this involves the design of high-efficiency db, this article will not repeat.

2.1.2 Why data splitting

There is a brief description and explanation of what is data segmentation, and the reader may wonder why it needs data segmentation. A mature and stable database like Oracle is enough to support the storage and querying of massive amounts of data? Why do we need data slicing? Indeed, Oracle's db is really mature and stable, but the high cost of use and high-end hardware support is not a business that every company can afford. Imagine a year of tens of millions of of the cost of use and tens of millions of dollars of minicomputer as a hardware support, which can be paid by the general company? Even if it is possible to pay, if there is a better solution, there is a more inexpensive and horizontal expansion of better performance solutions, why we do not choose?

However, things are always unsatisfactory. Usually we will consciously design our database according to the paradigm, the load point may consider using the relevant replication mechanism to improve the throughput and performance of Read and write, which may already meet a lot of requirements, but the mechanism of its own shortcomings is relatively obvious. First of all, its effectiveness depends on the ratio of read operations, Master will often become the bottleneck, write operations need to queue to execute, overload if master first can not carry, slaves data synchronization delay may be larger, and will greatly consume CPU computing power, Because the write operation is performed on master, it needs to be run on every slave machine. At this time sharding may become chicken ribs. Replication is uncertain, then why sharding can work? The reason is simple, because it can be well extended. We know that every machine, no matter how good it is configured, has its own physical limit, so when we apply something that has reached or goes far beyond a single machine, we can only look for help from another machine or continue to upgrade our hardware, but the common scenario is scale-out, By adding more machines to share the pressure. We also have to consider when our business logic is growing, can our machines meet demand through linear growth? Sharding can easily distribute compute, storage, I/O in parallel to multiple machines, which can take full advantage of the various processing capabilities of multiple machines while avoiding single points of failure, providing system availability and making good fault isolation.

In combination with these factors, data segmentation is necessary, and the data segmentation we are discussing here also takes MySQL as the background. Based on cost considerations, many companies also chose free and open MySQL. Developers who know about MySQL may know that the data table partitioning feature is available after MySQL 5, and before that many potential users of MySQL have concerns about MySQL extensibility, and whether partitioning is a key metric to measure the scalability of a database ( Of course not the only indicator). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on? The answer is also sharding, which is what we call a data segmentation scheme.

We use free MySQL and cheap server or even a PC cluster, to achieve the effect of minicomputer + large business db, reduce the amount of capital investment, reduce operating costs, why not? So, we choose Sharding, embrace sharding.

2.1.3 How to Do data segmentation

When it comes to data segmentation, again we talk about the methods and forms of data segmentation in detail and explained.

Data segmentation can be physical, the data through a series of segmentation rules to distribute the data to different DB servers, routing rules to access a specific database, so that each access to face is not a single server, but the n server, which can reduce the load pressure on a single machine.

The Data segmentation can also be within the database, through a series of segmentation rules, the data distributed to a database of different tables, such as the article into article_001,article_002 and other sub-tables, A number of sub-tables are flattened horizontally and have a complete article table that makes up the logic, which is actually very simple to do. For example, for example, the article table now has 5000w data, at this point we need to add (insert) a new data in this table, after the insert is complete, the database will re-index this table, The overhead of indexing 5000w rows of data is still not negligible. But conversely, if we divide the table into 100 tables, from article_001 to article_100,5000w, there is only 500,000 rows of data in each sub-table, At this time, we are going to a table with only 50w rows of data after the Insert data index will be a magnitude decline, greatly improving the efficiency of the DB runtime, increase the concurrency of the DB. Of course, the benefits of the table are unknown, and there are many obvious benefits to be found, such as the lock operation of the write operation.

In conclusion, the sub-Library reduces the load of the single point machine, and the table increases the efficiency of the data operation, especially the write operation. we still don't have a problem with how to slice. Next, we will elaborate and explain the segmentation rules.

As mentioned above, in order to achieve the level of data segmentation, in each table must have the redundant characters as the segmentation basis and the Mark field, the usual application we choose user_id as the field of distinction, based on the following three kinds of library methods and Rules: (Of course, there can be other ways)

By the number of segments:

(1) user_id for the distinction, 1~1000 corresponding db1,1001~2000 corresponding DB2, and so on;

Pros: can be partially migrated

Cons: Uneven data distribution

(2) Hash modulus:

Hash the user_id (or use the value of the user_id if the user_id is numeric), then use a specific number, such as the application needs to cut a database into 4 databases, we use 4 this number to User_ ID of the hash value of the modulo operation, that is user_id%4, so that each operation there are four possible: The result is 1 when the corresponding DB1; the result is 2 when the corresponding DB2; the result is 3 corresponds to DB3; The result is 0 corresponds to DB4, This allows the data to be distributed to 4 DB in a very uniform way.

Advantages: Uniform Data distribution

Cons: Data migration is cumbersome and cannot be apportioned according to machine performance

(3) Save the database configuration in the authentication library

is to set up a DB, this DB separately save user_id to DB mapping, each time you access the database to query the database first to obtain the specific DB information, and then we need to do the query operation.

Advantages: Strong flexibility, one-on-one relationship

Cons: More than one query before each query, performance is greatly compromised

These are the three ways that we choose in the usual development, and some of the more complex projects may be mixed with these three ways. through the above description, we have a simple understanding and understanding of the rules of the sub-Library. Of course, there will be better and more perfect way to divide the library, but also we need to constantly explore and discover

(GO) database Shard Shard operation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.