How can MySQL deal with dozens of terabytes of data increments in six months? How can mysql cope with TB increments?

Last Update:2016-04-07 Source: Internet

Author: User

Tags database join

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How can MySQL deal with dozens of terabytes of data increments in six months? How can mysql cope with TB increments?

From: Listening to cloud blog

Some time ago, Oracle officially released the GA version of MySQL 5.7. The new version implements true parallel Replication (Group Replication based on Group Commit), instead of schema-based parallel Replication. This feature greatly improves the high latency of master-slave replication in specific scenarios. With the increasing maturity of MySQL, more and more users choose to use MySQL to store their own data, many of which use MySQL to store a large amount of data.

Over the past six months, Tencent cloud's business volume has experienced explosive growth. The data volume at the backend has increased from several TB in the first quarter of last year to dozens of TB, and the business volume has increased by dozens of times. A major problem for backend applications and databases is frequent resizing to cope with the increase in front-end traffic. At the database level, we use MySQL to store business data in a distributed manner, and the database cluster architecture is relatively simple. We use the open-source middleware Amoeba to split data and read/write separation. Amoeba has hundreds of database node groups at the backend. Each node group contains a pair of master-slave instances. The master instance is responsible for receiving write requests, and the slave is responsible for accepting query requests. For example:

Correct splitting posture

As more open-source middleware is available, a large number of users who are not large in data volume will consider horizontal database splitting too early. But in fact, horizontal split too early is not seen as a meaningful thing. There are two main reasons: one is that horizontal split will impact the services of the current network. If the system has not considered splitting at the beginning of the design, this impact will be magnified. For example, when a business involves a large number of multi-table join queries or requires strong transaction consistency, horizontal splitting is too slow. On the other hand, if horizontal split is performed too early, it will take a lot of money to vertically split the time. Taking listening cloud app as an example, when we split the business database into eight shards, we found that the data grew fast one day, so we decided to split it vertically, split the hourly and daily data into a new instance, at this time, we have to deploy eight node groups at the same time to migrate the data of the hour and day on the existing eight shards. The workload is quite large. If you split 64 parts horizontally, You need to split them vertically to ensure that you do not need them if you are tired.

Therefore, the more reasonable way is to vertically split the business data. The original database is vertically split into multiple databases by business unit, when multiple data sources are configured in the application or middleware is used to access multiple split databases, no changes are made to the application, however, the capacity and performance of the backend storage have increased several times. If a bottleneck occurs one day, consider horizontal splitting.

Elegant from n to 2n

The biggest headache in the horizontal scaling process is data migration. For example, to migrate data with mod (mobile_app_id, 4) = 2, first, create two new node groups shared0_new and shared2, recover the full backup of shared0 to shared0_new and shared2, and then delete mod (mobile_app_id, 4) on shared0_new) delete mod (mobile_app_id, 4) = 0 data on shared2. After the delete operation is complete, shared0_new, shared2 and shared0 are synchronized, synchronize the incremental data during the deletion operation. After synchronization, switch the amoeba routing rules and then deprecate shared0. This method has many problems. First, it consumes a lot of time. After the delete operation is complete, the storage space cannot be released. optimize table is also a long process. The delete operation on a large table generates a large transaction, which applies for a large undo in the system tablespace. After the delete operation is complete, the transaction is committed. This undo space will not be released, but will be reused directly for other transactions, which will undoubtedly waste a lot of storage space.

Then we came up with a convenient way to use the-where parameter of mysqldump to add a parameter mod (mobile_app_id, 4) = 2 When backing up data, you can back up the data with the remainder of 2 separately and use this logical backup to restore it to shared2. this is efficient and elegant.

Data skew

An inevitable problem with MySQL distributed storage is data skew. After the business runs for a period of time, it will find that a small amount of shared data increments are extremely fast, because the data volume of some users above the shared data is large. Our current approach to data skew is to migrate these shared data to 1 TB storage, but this is not a long term. Therefore, we are currently making some new attempts. For example, we have made some extensions for Amoeba. The extended Amoeba allows us to separately direct data from a mobile_app_id to a backend shared node group, that is to say, a shared object only stores the data of one user and uses the ToKuDB storage engine to store the data. ToKuDB can effectively compress the data, except for a slight loss of query performance, basic Features of the InnoDB engine, and online table structure changes are several times faster than InnoDB. These tests are coming to an end and will soon be applied to the production environment.

Distributed join

There is still no perfect solution for Distributed join in the industry. Fortunately, at the beginning of its design, the cloud business avoided multi-table join. In the business database, each latitude in a report corresponds to a table. Therefore, querying a latitude directly queries a table at the backend. operations are performed on each table. Currently, there are two popular distributed join solutions:

1. Global table format. For example, table A joins table B, and table B is distributed and stored in multiple shared instances. If table A is small, you can store A full copy of Table A on all the shared instances. Therefore, join can be performed very efficiently. It looks nice, but there are a lot of restrictions and application scenarios.

2. E-R form. For example, the user table user (id, name) and order table order (id, uid, detail) are sharded by user id. The uid of the order table is referenced from the id of the user table. When storing the order, first determine the shared information of the user corresponding to the order, and then insert the order record to the shared information of the user. In this way, when you retrieve all the orders of a user, this avoids inefficient cross-database join operations.

Currently, in open-source middleware, MyCat is more delicate in Distributed join Processing. Alibaba's DRDS also handles Distributed join operations.

What is MySQL good?

Any tool may only solve problems in a certain field. It is certainly not universally available. The correct method is to let the tool do what you are good. Relational databases are good at structured queries and are not good at cleaning massive amounts of data. When we generate the 2015 APP industry average data report, we need to summarize the relevant data on all the backend shared data and perform further analysis. The data is finally summarized in five tables, each table has hundreds of millions of records. Then, take the sum value of some indicators after the group by Field 5 and 6, and initially try to process the data in MySQL. The MySQL instance provides 24 GB memory, as a result, OOM does not produce any results after several times. Finally, the data is pulled to the hadoop cluster, and the impala engine is used to summarize the data. The maximum number of tables is nearly 0.7 billion records, and results are generated in about 9 minutes. Therefore, do not have the idea of all in one. Let every component in the system do what you are good.

O & M in Distributed MySQL Architecture

Although MySQL distributed solution solves storage and performance problems, it brings some pain points in the O & M support process.

1. Cross-partition statistics. Middleware cannot query the full data of the backend. For cross-partition full data queries such as the annual APP industry average report, you can only use automated scripts to extract data from the backend one by one shared, and then summarize.

2. DML. There is often a need to change the table structure, and most of such operations are not supported by middleware. If there is only one library, it is a headache when there are dozens of backend shared data, currently, we do not have a good solution. We can only use automated scripts to batch execute commands on the backend shared. After the execution is complete, run a validation script to manually check the output content of the validation script.

To deal with such a situation, the hair style will inevitably look a little messy, but it is still helpless, it is necessary to re-design our script, write a more friendly output, fully automated tools out.

Link: http://blog.tingyun.com/web/article/detail/386

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More