MySQL Sharding Detailed

Source: Internet
Author: User
Tags joins lua mysql version table definition

Turn from (http://blog.sina.com.cn/s/blog_821512b50101hyc1.html) MySQL sharding detailed

A background

We know that when the amount of data in a database gets bigger, the pressure gets bigger, whether it's reading or writing. MySQL replication multi-master multi-slave scheme, load balancing on the upper level, although to some extent to alleviate the pressure. But when the data in a table becomes very large, the pressure is still very large. Imagine, if the amount of data in a table reached tens of millions or even billions of levels, whether it is to build indexes, optimize the cache, etc., will face enormous performance pressure.

Two definitions

Data sharding, also known as data slicing, or partitioning. is to reduce the pressure on a single machine by spreading the data in the same database across multiple databases or multiple machines in a certain condition.

Three categories

Data partitioning can be divided into two categories according to the segmentation rules:

1 , Vertical segmentation

Vertical segmentation of data can also be referred to as vertical segmentation. Think of the database as a large chunk of chunks of "data blocks" (tables), we cut these "chunks" vertically, and then scatter them across multiple database hosts. Such a segmentation method is a vertical (longitudinal) data segmentation. Distribute different tables to different databases or hosts in table units. The rules are simple, easy to implement and suitable for systems with low coupling between business.

Sharding detailed "title=" MySQL sharding detailed "height=" 373 "width=" 553 ">

Advantages of vertical slicing

(1) The splitting of the database is simple and clear;

(2) Application module clear and clear, easy to integrate;

(3) Data maintenance convenient and easy to locate;

Disadvantages of vertical slicing

(1) Some table associations cannot be completed at the database level and need to be completed in the program;

(2) The performance of the table with extremely frequent access and large data volume still exists, not necessarily meet the requirements;

(3) Transaction processing is relatively more complex;

(4) When the segmentation reaches a certain extent, the extensibility will be limited;

(5) Over-reading segmentation can lead to complex system transitions and difficult maintenance.

2 , Horizontal segmentation

In general, simple horizontal slicing is the main way to spread an extremely mundane table into multiple tables, with a subset of the data in each table, according to some rule of a field. In the behavior unit, the data in the same table is split into different databases or hosts in a certain condition. Relatively complex, suitable for single-table huge system.

Sharding detailed "title=" MySQL sharding detailed "height=" 372 "width=" 553 ">

Advantages of horizontal slicing

(1) The table association is basically able to complete the database end;

(2) There will not be some very large data volume and high load of the table encountered a bottleneck problem;

(3) The application side of the overall architecture changes relatively small;

(4) Transaction processing is relatively simple;

(5) As long as the segmentation rules can be defined, it is basically more difficult to meet the limitations of extensibility;

Disadvantages of horizontal slicing

(1) The segmentation rules are relatively more complex, and it is difficult to abstract a segmentation rule that satisfies the whole database.

(2) Later data maintenance difficulty has increased, manual positioning of data is more difficult;

(3) The coupling degree of each module of the application system is high, which may cause some difficulties in the migration and splitting of the later data.

3 , Joint segmentation

In the actual application scenario, except that the load is not too large, the business logic is relatively simple system can solve the extensibility problem by one of the two methods above, I am afraid that most of the other business logic is slightly more complicated, the system load is larger system, Can not be achieved by any of the above data segmentation method to achieve better scalability, and need to use the two methods of segmentation, different scenarios using different segmentation methods.

Sharding detailed "title=" MySQL sharding detailed "height=" 480 "width=" 342 ">

Benefits of Joint Segmentation

(1) Can make full use of vertical segmentation and horizontal segmentation of their respective advantages to avoid their own shortcomings;

(2) to maximize system scalability;

Disadvantages of Joint segmentation

(1) The database system architecture is more complex, the maintenance is more difficult;

(2) The application architecture is also relatively more complex;

Four implementation Scenarios

Now sharding related software implementation In fact a lot, based on database layer, DAO layer, different languages are not lack of cases. Confined to space, here is only a brief introduction.

1 , Mysql Proxy + Hascale

A set of more promising options. MySQL proxy is implemented with Lua script, between the client and server side, play the role of proxy, provide query analysis, failure takeover, query filtering, adjustment and other functions. The current 0.6 version also does not read, write separation. Hscale is for MySQL Proxy plug-ins, but also implemented with Lua, the sharding process has been simplified a lot. It should be noted that MySQL Proxy and hscale each have a certain amount of overhead, but this overhead and centralized data processing method of single query cost is still small.

Mysqlproxy is a database agent tier product that is officially provided by MySQL, and, like MySQLServer, is an open source product based on the GPL open source protocol. can be used to monitor, analyze, or transmit communications between them. His flexibility allows you to use it to the fullest, with features such as connection routing, query analysis, query filtering and modification, load balancing, and basic ha mechanisms.

In fact, the mysqlproxy itself does not have all of these features, but rather provides the basis for implementing the above functions. To implement these features, we also need to write Lua scripts ourselves.

Mysqlproxy actually establishes a connection pool between the client request and the MySQLServer. All client requests are sent to Mysqlproxy and then analyzed by Mysqlproxy to determine whether the read or write operation is distributed to the corresponding MySQLServer. For multi-node slave clusters, load Balancing can also be achieved. Here is the basic architecture diagram for Mysqlproxy:

Sharding detailed "title=" MySQL sharding detailed "height=" 480 "width=" 420 ">

Through the above schematic diagram, we can clearly see mysqlproxy in the actual application of the location, and can do the basic things. About Mysqlproxy more detailed implementation details in the MySQL official document has a very detailed introduction and examples, interested readers can be directly from the official MySQL Web site download or online reading, I am not tired of wasting paper.

Http://forge.mysql.com/wiki/MySQL_Proxy

2 , Hibernate shards

This is a contribution from the Google Technology team (http://www.hibernate.org/414.html), which was born in the sharding process of data on Google's financial system. Because it is implemented at the framework level, it has its own unique features: The standard Hibernate programming model, which can be done with hibernate, has low technical cost, a relatively resilient sharding strategy, and supports virtual Shard.

3 , Spock Proxy

This is also an open source project that is generated in actual demand, based on the MySQL proxy extension. Spock (http://www.spock.com/) is a Web 2.0 site for people looking for. The Spock Proxy (http://spockproxy.sourceforge.net/) project is produced by effectively sharding its own single DB, and Spock Proxy is a branch of MySQL proxy that provides a fan-based The sharding mechanism of the enclosure. Spock is based on rails, so Spock Proxy is also built on rails, and friends who are concerned about ROR should not miss the project.

http://spockproxy.sourceforge.net/

4 , Amoeba for MySQL

Amoeba is a Java-based open source framework that focuses on solving distributed database data source integration proxy programs, based on the GPL3 Open source protocol. At present, amoeba already has query routing, query filtering, read/write separation, load balancing and HA mechanism and other related content.

Amoeba mainly addresses the following issues:

(1) Complex data source integration after data segmentation;

(2) Provide data segmentation rules and reduce the impact of data segmentation rules on the database;

(3) Reduce the number of connections between the database and the client;

(4) Read and write separate routes.

We can see that what amoeba is doing is exactly what we need to improve the scalability of the database through data segmentation.

Amoeba is not an agent layer of proxy program, but a development of the database Agent layer Proxy Program development framework, currently based on amoeba developed a proxy program has amoebaformysql and Amoebaforaladin two.

Amoebaformysql is primarily a solution for MySQL databases, and the protocol requested by front-end applications and the data source database for back-end connections must be MySQL. There is no difference between amoebaformysql and a MySQL database for any client application, and any client requests that use the MySQL protocol can be amoebaformysql parsed and processed accordingly. Below you can tell us about the architecture of Amoebaformysql (from the Amoeba Developer blog):

Sharding detailed "title=" MySQL sharding detailed "height=" 480 "width=" 384 ">

Amoebaforaladin is a more extensive, more powerful proxy program. He can connect data sources from different databases at the same time to serve front-end applications, but only accept client application requests that meet the MySQL protocol. That is, as long as the front-end application is connected through the MySQL protocol, Amoebaforaladin automatically parses the query statement and automatically identifies which physical host of the type database the data source of the query is on, based on the data requested in the query statement. Shows the architectural details of Amoebaforaladin (from the Amoeba Developer blog):

Sharding detailed "title=" MySQL sharding detailed "height=" 480 "width=" 384 ">

At first glance, the two seem to be exactly the same. After a closer look, you will find that the main difference is only after processing through mysqlprotocaladapter, according to the results of the analysis of the data source database, and then select a specific JDBC driver and the corresponding protocol to connect to the backend database.

In fact, through the above two frame composition you may have discovered the characteristics of the amoeba, he is just a development framework, we in addition to the choice he has provided formysql and Foraladin these two products, but also based on their own needs for the corresponding two development, To get more adapted to our own application characteristics of the proxy program.

When it comes to using MySQL database, both Amoebaformysql and Amoebaforaladin can be used very well. Of course, given the complexity of any one system, its performance will certainly have a certain loss, maintenance costs will naturally be relatively higher. So, I recommend using Amoebaformysql for just the MySQL database.

Amoebaformysql is very simple to use, all the configuration files are standard XML files, a total of four configuration files. The following were:

(1) Amoeba.xml: Master configuration file, configure all data sources and amoeba parameter settings;

(2) Rule.xml: Configure all the query routing rules information;

(3) Functionmap.xml: Configures the Java implementation class that is used to parse the function in query;

(4) Rullfunctionmap.xml: The implementation class that configures the specific functions that need to be used in the routing rules;

If your rules are not too complex, you basically need to use only the first two of the four configuration files above to do all the work. Proxy programs commonly used functions such as read-write separation, load balancing and other configurations are carried out in the amoeba.xml. In addition, amoeba has supported automatic routing for vertical and horizontal segmentation of data, and routing rules can be set in Rule.xml.

At present, the main lack of amoeba is its online management function and support for the business, has been in the communication process with the relevant developers have made relevant recommendations, I hope to provide an online maintenance management of the command-line management tools, easy to use online maintenance, The feedback received is that management-specific management modules have been incorporated into the development agenda. In addition, the transaction support is temporarily or amoeba cannot do, even if the client application in the request submitted to amoeba contains transaction information, Amoeba will also ignore the transaction-related information. Of course, after continuous improvement, I believe that business support is definitely the amoeba of the feature.

More detailed usage of Amoeba reader friends can be obtained from the user manual provided on the Amoeba Developer blog (http://amoeba.sf.net), which is not described here.

Case (http://pengranxiang.iteye.com/blog/1145342)

Operations Documentation (HTTP://DOCS.HEXNOVA.COM/AMOEBA/CHAP-GETTING-STARTED.HTML)

5 , Hivedb

Like the previous mysqlproxy and Amoeba, Hivedb is also an open source framework for data segmentation and consolidation based on Java for MySQL databases, but the current hivedb only supports horizontal segmentation of data. It mainly solves the expansibility of database and high-performance access problem of data, and supports the redundancy of data and the basic HA mechanism.

Hivedb implementation mechanism and mysqlproxy and amoeba have a certain difference, he does not use the replication function of MySQL to achieve data redundancy, but to implement the data redundancy mechanism, The bottom layer is mainly based on the hibernateshards to achieve the data segmentation work.

In Hivedb, data is dispersed across multiple mysqlserver through a variety of user-defined Partitionkeys, in fact, data segmentation rules. At the time of access, when the query request is run, the filter condition is automatically parsed, the data is read in parallel from multiple mysqlserver, and the result set is merged back to the client application.

Purely from the functional aspect, Hivedb may not be as powerful as mysqlproxy and amoeba, but the idea of data segmentation has no essential difference from the previous two. In addition, Hivedb is not just a content shared by open source enthusiasts, but an open source project supported by commercial companies.

The following is an image of the official website of Hivedb, which describes how Hivedb organizes the basic information of the data, although it can not show too much information about the architecture in detail, but also basically shows its unique aspect in data segmentation.

Sharding detailed "title=" MySQL sharding detailed "height=" 471 "width=" 553 ">

http://www.hivedb.org/

6 , Datafabric

Application-level sharding

Master/slave replication

Https://github.com/bpot/data_fabric

7 , Pl/proxy

The first few are for the MySQL sharding scheme, Pl/proxy is for PostgreSQL, design ideas like Teradata Hash mechanism, the data store is transparent to the client, the customer request sent to Pl/proxy, by the distributed storage Unified distribution of the process calls. Pl/proxy is designed to act as a "data bus" in this layer, so when the throughput is not supported, you only need to add more Pl/proxy servers. The famous Skype for use is the Pl/proxy solution.

8 , Pyshards

This is a Python-based solution. The goal of the tool is to have a re-balancing in it, which is a more radical idea. Only MySQL database is currently supported.

Http://code.google.com/p/pyshards/wiki/Pyshards

9 , and other solutions for data segmentation and integration

In addition to the several data segmentation and integration solutions described above, there are many other solutions that also provide data segmentation and integration. such as Mysqlproxy based on the further expansion of the Hscale, through rails built spockproxy, as well as Pathon-based pyshards and so on.

Regardless of which solution you choose to use, the overall design approach should basically not have any change, that is, through the vertical and horizontal segmentation of data, enhance the overall service capabilities of the database, so that the overall expansion of the application system as far as possible, the expansion of the way as easy as possible.

As long as we through the middle-tier proxy application better solve the problem of data segmentation and data source integration, then the linear scalability of the database will be easy to do as our application, just need to add a cheap pcserver server, can linearly increase the overall service capacity of the database cluster, Make the database no longer an easy performance bottleneck for application systems.

Five things to note

The following is what we call partitioning, mainly refers to horizontal partitioning.

1, before implementing the partition, we can check whether the installed version of MySQL support partition:

Mysql> Show variables like "%partition%";

If supported, it will show:

+-------------------+-------+

| variable_name | Value |

+-------------------+-------+

| have_partitioning | YES |

+-------------------+-------+

2. The partition applies to all data and indexes of a table, cannot partition data only, not partition the index, and vice versa, and cannot partition only part of the table.

3. Type of partition

(1) RANGE partition: Assigns multiple rows to a partition based on column values that belong to a given contiguous interval.

(2) List partition: Similar to by range partitioning, the difference is that a list partition is selected based on a value in a set of discrete values that match a column value.

(3) hash partition: A partition that is selected based on the return value of a user-defined expression that is calculated using the column values of those rows that will be inserted into the table (the scheme used by Sina Weibo).

(4) Key partition: Similar to partition by hash, the difference is that the KEY partition only supports the calculation of one or more columns, and the MySQL server provides its own hash function. You must have one or more columns that contain integer values.

Partitions are always numbered automatically when they are created, regardless of the type of partition used, and are recorded starting at 0. When a new row is inserted into a partitioned table, these partition numbers are used to identify the correct partition.

4, MySQL provides many ways to modify the partition table. It is possible to add, delete, redefine, merge, or split partitions that already exist. All of these operations can be implemented by using the partition extension of the Alter TABLE command.

5. You can partition an existing table and use the ALTER TABLE command directly.

Six possible problems of data segmentation and integration

Here, we should have a certain understanding of the implementation of data segmentation and integration, perhaps many readers have been based on various solutions to the merits of the respective characteristics of the basic selection of suitable for their own application scenarios, the work behind the main is the implementation of the preparation.

There are some possible problems that we need to do before we can implement a data-splitting scheme. In general, the problems we may encounter are mainly as follows:

1, the introduction of distributed transaction issues

Once the data is sliced and stored in multiple mysqlserver, no matter how perfect our segmentation rules are (in fact, there is no perfect segmentation rule), it is possible that the data involved in some of the previous transactions is no longer in the same mysqlserver.

In such a scenario, if our application is still in accordance with the old solution, then it is bound to need to introduce a distributed transaction to solve. In each version of MySQL, the distributed transaction is supported only by the versions since MySQL5.0 began, and only INNODB provides distributed transaction support. Not only that, even if we are using a MySQL version that supports distributed transactions, but also a InnoDB storage engine, the distributed transaction itself consumes a lot of system resources, and the performance itself is not too high. And the introduction of distributed transaction itself in the exception processing will bring more difficult to control factors.

What to do? In fact, we can use a workaround to solve this problem, the first thing to consider is: whether the database is the only place to solve the transaction? This is not the case, we can completely combine the database and the application both to solve together. Each database resolves its own transactions, and then uses the application to control transactions on multiple databases.

That is, as long as we are willing, a distributed transaction spanning multiple databases can be split into multiple small transactions that are only on a single database, and the application is used to control small transactions. Of course, the requirement for this is that our Russian application must be robust enough to provide some technical difficulties for the application as well.

2. Cross-node Join issues

The above describes the possible introduction of distributed transactions, and now let's look at the problem of cross-node join. Data segmentation may cause some old join statements to be unusable because the data source used by join may be sliced into multiple mysqlserver.

What to do? This problem from the MySQL database point of view, if it has to be directly resolved on the database side, I am afraid only through the MySQL a special storage engine federated to solve. The Federated storage Engine is a solution for MySQL to address issues such as Oracle-like Dblink. The main difference with Oracledblink is that federated will save a copy of the definition of the remote table structure locally. At first glance, Federated is really a great solution for cross-node joins. But we should also be clear, it seems that if the remote table structure is changed, the local table definition information will not follow the corresponding changes. If you update the remote table structure without updating the local federated table definition information, it is likely to cause query to run out of error and not get the correct results.

To deal with this kind of problem, I still recommend using the application to process, first in the driver table in the MySQLServer to take out the corresponding drive result set, and then based on the driving result set to the driver table is located in the MySQLServer to fetch the corresponding data. Many readers may think that this will have a certain impact on performance, yes, it does have a certain negative impact on performance, but in addition to this method, there are basically not many other better solutions. Moreover, because the database through the good expansion, each mysqlserver load can get better control, purely for a single query, its response time may be higher than the non-segmentation, so the performance of the negative impact is not too large. What's more, similar to the need for cross-node joins is not too much, and probably only a small part of the overall performance. So for the overall performance of the consideration, the occasional sacrifice a little bit, in fact, is worth it, after all, the system optimization itself is a lot of trade-offs and balance process.

3, cross-node merge sorting paging problem

Once the data has been sliced horizontally, it is possible that not only the cross-node join does not work, but the data source for some sort of paged query statement may also be sliced to multiple nodes, and the immediate consequence is that these sort paged queries cannot continue to function properly. In fact, this and cross-node join is a reason, the data source exists on multiple nodes, to be resolved through a query, and cross-node join is the same operation. The same federated can be partially solved, of course, the same risks exist.

Or the same problem, how to do? I also continue to recommend that the application be addressed.

How to solve? The solution is broadly similar to that of a cross-node join, but there is a point that joins the cross-node join, which often has a driver-driven relationship, so there is generally a sequential relationship between the data reads of multiple tables involved in the join itself. But sort of paging is not the same, the sort of paging data source can basically be said to be a table (or a result set), itself does not exist a sequential relationship, so the process of fetching data from multiple data sources can be completely parallel. In this way, the fetch efficiency of the sorted paging data can be higher than the cross-Library join, so the performance loss is relatively smaller and in some cases more efficient than the original data-splitting database. Of course, whether it's cross-node join or sorting across nodes, our application server consumes more resources, especially memory resources, because the process of reading access and merging result sets requires more data than was originally processed.

Analysis here, perhaps a lot of reader friends will find that all of these issues above, I give the advice is basically through the application to solve. People may start to whispered, is it because I am a DBA, so many things are thrown to application architects and developers?

In fact, this is not the case, the first application because of its particularity, it can be very easy to do very good extensibility, but the database is different, it must be used in many other ways to expand, and in this expansion process, It is difficult to avoid situations where some of the original problems can be solved in a centralized database, but split into a single DB cluster. In order to maximize the overall system, we can only let the application do more things to solve the problem that the database cluster can not be better solved.

Seven MySQL examples of horizontal and vertical shards based on amoeba

For detailed configuration description, please refer to: http://pengranxiang.iteye.com/blog/1145342

MySQL Sharding Detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.