Concepts and Principles of MySQL Data splitting

Last Update:2013-12-16 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ForData splitting, We may not be very familiar with it, but itMySQLThe database is also a very important technology. This article will introduce the knowledge of data splitting in the MySQL database in detail. Next, let's take a look at this part.

What is data splitting?

The word "Shard" refers to "Fragmentation" in English. As a database-related technical term, it seems to have been first seen in large multiplayer online role-playing games. "Sharding" is called "Sharding ". Sharding is not a new technology, but a relatively simple software concept. As we all know, the Data Table Partitioning function was available only after MySQL 5. Before that, many potential MySQL users had concerns about MySQL scalability, whether the partition function is available is a key indicator (of course not the only indicator) for measuring the scalability of a database ).

Database scalability is an eternal topic. MySQL promoters are often asked: How does one implement partitioning to process application data in a single database? The answer is Sharding. Sharding is not a function attached to a specific database software, but an abstract Processing Based on specific technical details. It is a horizontally scalable (Scale Out) solution, its main purpose is to break through the I/O capability limitations of Single-node database servers and solve database scalability problems.

Data is horizontally distributed to different databases or tables through a series of sharding rules. The specific DB or table to be queried is found through the corresponding DB route or table routing rules, to perform Query operations. The "sharding" mentioned here usually refers to "Horizontal splitting", which is also the focus of this article. What are the specific splitting methods and routing methods? At this point, readers will inevitably have doubts. Next, we will give a simple example: we will explain the logs in a Blog application. For example, the log article table has the following fields: article_id (int ), title (varchar (128), content (varchar (1024), user_id (int ).

How can we split such a table? How can we distribute such data to tables in different databases? In fact, it is not difficult to analyze the blog application to conclude that there are two types of users in the blog application: the viewer and the owner of the blog. A visitor browses a blog under a specific user's blog, while the owner of the blog manages his or her own blog, in the same way, you can operate in your own space under a specific user blog ). For a specific user, the database field is "user_id ". This "user_id" is the basis for database sharding and rules. We can do this and set user_id to 1 ~ Put all article information of 10000 into the article table in DB1, and set user_id to 10001 ~ All article information of 20000 is placed in the article table in DB2, and so on until DBn.

In this way, the article data is naturally divided into various databases, achieving the goal of data segmentation. The next problem to be solved is how to find the specific database? In fact, the problem is also simple and obvious. Since we used to distinguish the field user_id during database sharding, it is natural that the database routing process is still subject to user_id. Consider the blog application we just presented, whether it's accessing others' blogs or managing their own blogs. In short, I want to know who this blog user is, that is, if we know the user_id of this blog, we will use this user_id and use the database sharding rules to locate the specific database. For example, if user_id is 234, we will use this rule, DB1 should be located. If user_id is 12343, then DB2 should be located using this rule. Similarly, the database sharding rule is used to reverse route to a specific database. This process is called "DB routing ".

Of course, considering that the database design for data splitting is not conventional, the database design is not orthodox. So what kind of DB Design is an orthodox DB Design?

We usually use it in a regular manner. We usually consciously design our databases according to the paradigm. The high load may consider using related Replication mechanisms to improve read/write throughput and performance, which may already meet many requirements, however, the defects of this mechanism are obviously mentioned below ). As mentioned above, "consciously follow the paradigm design ". Considering the Database Design of Data splitting, this rule and constraint will be violated. In order to split, we have to have redundant fields in the database table for distinguishing fields or marked fields called database sharding, for example, the user_id field in the above article example does not reflect the redundancy of user_id, this is also coming soon, so we can take advantage of it ). Of course, the emergence of redundant fields is not only in the case of database sharding. In many large applications, redundancy is also necessary. This involves the design of efficient DB, which will not be repeated here.

Why Data splitting?

What is data splitting? A brief description and explanation are provided. Readers may wonder why data splitting is required? A mature and stable database like Oracle is enough to support the storage and query of massive data? Why do we still need data slicing? Indeed, Oracle's database is indeed mature and stable, but the high cost of use and high-end hardware support is not what every company can afford. Imagine using tens of millions of user fees and tens of thousands of minicomputers as hardware support in a year. Can this be paid by General companies? That is to say, it can be paid. If there is a better solution, there is a cheaper solution and a better horizontal scalability performance, why don't we choose?

However, things are always unsatisfactory. We usually consciously design our databases according to the paradigm. The high load may consider using the relevant Replication mechanism to improve read/write throughput and performance, which may already meet many requirements, however, the defects of this mechanism are quite obvious. First of all, its effectiveness depends on the proportion of read operations. The Master node often becomes the bottleneck, and write operations need to be executed in a sequential queue. If the Master node is overloaded, the Master node will not be able to handle it first, the latency of Slaves Data Synchronization may also be large, and it will greatly consume the CPU computing power, because the write operation still needs to run once on each slave machine after being executed on the Master. At this time, Sharding may become a chicken rib.

If Replication is not feasible, why can Sharding work? The principle is very simple, because it can be well extended. We know that each machine has its own physical ceiling no matter how well it is configured, so when our application can reach or far exceed a certain ceiling of a single machine, we only need to seek help from other machines or upgrade our hardware. However, common solutions are still horizontal expansion, and more machines are added to share the pressure. We have to consider whether our machines can meet our needs through linear growth as our business logic continues to grow? Sharding allows you to easily distribute computing, storage, and I/O to multiple machines in parallel. This allows you to take full advantage of various processing capabilities of multiple machines and avoid single point of failure, provides system availability for error isolation.

Based on the above factors, data segmentation is necessary, and the data segmentation we discuss here also uses MySql as the background. Based on cost considerations, many companies also choose Free and Open MySql. Developers who have some knowledge about MySql may know that MySQL 5 has the data Table Partitioning function. Before that, many potential MySQL users have concerns about MySQL scalability, whether the partition function is available is a key indicator (of course not the only indicator) for measuring the scalability of a database ). Database scalability is an eternal topic. MySQL promoters are often asked: How does one implement partitioning to process application data in a single database? The answer is Sharding, which is what we call the data splitting solution.

How to split data

Speaking of Data splitting, Let's explain in detail the methods and forms of data splitting.

Data splitting can be physical. Data is distributed to different DB servers through a series of splitting rules, and a specific database is accessed through routing rules, in this way, each access is not faced with a single server, but with N servers, which can reduce the load pressure on a single machine.

Data sharding can also be performed in a database. Data is distributed to different tables in a database through a series of sharding rules. For example, article is divided into sub-tables such as article_001 and article_002, the horizontal concatenation of several sub-tables forms a complete logical article table. This is actually very simple. For example, if the article table contains million data records, add an insert statement to the table. After the insert statement is complete, the database re-creates an index for this table. The system overhead of indexing million rows of data cannot be ignored.

But in turn, if we divide the table into 100 tables, from article_001 to article_100, 0.5 million rows of data are averaged, and each sub-table contains only rows of data, at this time, the index creation time will be decreased by an order of magnitude after we insert data to a table with only 50 million rows of data, which greatly improves the running timeliness of the database, improves the concurrency of the database. Of course, the benefits of table sharding are unknown, and there are also lock operations such as write operations, which will bring many obvious benefits.

In summary, database sharding reduces the load on single-point machines and table sharding improves the efficiency of data operations, especially Write operations. At this point, we still have no question about how to split. Next, we will describe and describe the sharding rules in detail.

This article introduces the concepts and principles of MySQL Data splitting!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More