Deep interpretation of the horizontal sub-Library & de-IoE under massive data storage and access

Source: Internet
Author: User

With the development of large-scale Internet applications, the storage and access of massive data becomes the bottleneck of system design, and distributed processing becomes the perfect choice. Database splitting, especially the horizontal library, is a difficult job, involving a series of technical decisions.

I was fortunate to be responsible for the 1th shop Order level sub-library design and implementation of the project, here combined with the practice of the level of the sub-Library to do a systematic analysis, I hope for the level of the library (including to the IoE) to provide ideas for transformation, the main content includes:

    • Horizontal Sub-Library description
    • Sub-Library dimension – which field is divided by the library
    • Sub-Library policy – How records are assigned to different libraries
    • Number of libraries – Number of initial libraries and how many libraries are growing
    • Route transparency – How to implement Seobinggo, support application transparency
    • Paging-paging case across multiple libraries how to handle
    • Lookup mapping-non-library fields are mapped to a library field for library access
    • Overall architecture – The overall technical architecture of the library
    • On-Line Steps – Library transformation implementation on-line
    • Project Summary
Horizontal Sub-Library description

There are two types of database splits:

1) Vertical Sub-Library
There are too many tables in the database, take out part of the new library, generally according to the Business Division table, close the table to put the same database, the application of database connection can be modified, relatively simple.
2) Horizontal sub-Library
A table is too large, a single database can not be stored or access performance pressure, a table is split into multiple, each table to store part of the records, stored in different databases, the level of the library needs to be large-scale transformation of the system.

Shop 1th The core order table is stored in the Oracle database, records hundreds of billions, the number of fields, access to the model is also complex, with the rapid growth of the business, regardless of storage space or access performance are facing enormous challenges, especially in the big promotion, the order database has become a system bottleneck.

There are usually two workarounds:

1) scale up, upgrade the physical machine where the Oracle database is located, improve memory/storage/io performance, but this upgrade is expensive and can only meet short-term needs.
2) scale out, split the order library into multiple libraries, dispersed across multiple machines for storage and access, this practice supports horizontal expansion to meet long-term needs.

Shop 1th to take the latter approach, its order base mainly includes the Order Main Table/order schedule (record product details)/order Extension table, the horizontal sub-library is to divide the records of these 3 tables into multiple databases, the order of the horizontal sub-Library effect as shown:

The original Oracle library was replaced by multiple MySQL libraries, support 1 master multi-standby and read-write separation, the master and slave through the MySQL-brought data synchronization mechanism (sla<1 seconds), all applications through the order service access order data.

The dimension of the library

The horizontal sub-library first of all to consider which field as the library dimension, the selection criteria is to avoid the application code and SQL performance is affected, which requires that the current SQL in the library, access as far as possible in a single library, or library access into a multi-library scan, read and write performance and application logic will be greatly affected.

For order splitting, the first thing you think of is split according to the user ID, the conclusion is correct, but it is better to speak the data, can not shoot the head. A good practice is to collect all the SQL first, pick the filter field that most often appears in the where statement, such as the user ID/PO ID and Merchant IDs, each of which has three cases in SQL:

    • Single ID filtering, such as user id=?
    • Multi-ID filtering, such as User ID in (?,?,?)
    • The ID does not appear

Then further statistics, assuming a total of 500 SQL access to the order library, 3 filter fields appear as follows:

The conclusion is obvious that the user ID should be chosen for the sub-Library.

Wait, this is just static analysis, each SQL access is different, so you also have to analyze the number of accesses per SQL. We analyzed the SQL that TOP15 executed the most (they accounted for the total number of executions 85%), and if they were divided by user ID, those SQL 85% fell to a single database, 13% to multiple databases, and only 2% needed to traverse all the databases, significantly better than using other IDs for the library.

Through quantitative analysis, we know that according to the user ID sub-Library is optimal, but also roughly know the impact of the existing system, such as in this example, 85% of SQL will fall into a single database, this part of the performance of the access will be optimized, to strengthen the confidence of the parties to the library.

Sub-Library strategy

After the library dimension is determined, how can the records be divided into various libraries? There are generally two ways:

    • Depending on the range of values, for example, a record with a user ID of 1-9999 is divided into the first library, 10000-20000 is divided into the second library, and so on.
    • Based on the numerical modulus, such as user ID mod n, a record with a remainder of 0 is placed in the first library, the remainder is 1 to the second library, and so on.

The pros and cons of the two methods are compared as follows:

In practice, in order to deal with the simple, choose MoD sub-Library more. At the same time two times the library, in order to facilitate data migration, is generally increased by multiples, such as the initial 4 libraries, two split into 8, and then 16. So for a library of data, half of the data moved to the new library, the remaining fixed, compared to only one library at a time, all the data will be large-scale changes.

In addition, the MoD library generally has a more uniform number of records per library, but there are some databases, there are super IDs, the records of these IDs far exceed other IDs, such as in the ad scene, a large advertiser may occupy a large proportion of the total. If you take a modular library by The Advertiser ID, there are a number of records for some libraries, and for these super IDs you need to provide a separate library to store the records.

Number of libraries

The number of libraries is first related to the number of records that can be processed by the library, in general, Mysql single library more than 50 million records, Oracle Single library more than 100 million records, DB pressure is very large (of course, the processing capacity and number of fields/access mode/record length has a further relationship).

In order to meet the above premise, if the number of sub-libraries is less than the purpose of decentralized storage and reduce the performance of the DB; if the number of libraries, the advantage is that each library record less, library access performance is good, but for access across multiple libraries, the application needs to access multiple libraries, if the concurrency mode, to consume valuable thread resources If it is in serial mode, the execution time will increase sharply.

The final number of libraries also directly affect the input of the hardware, generally each sub-library run on a separate physical machine, more than one library means more than one device. Therefore, the specific number of libraries, to comprehensive assessment, the general initial sub-library recommendations are divided into 4-8 libraries.

Route transparency

In a sense, the library means that the DB schema changes and inevitably affects the application, but this change is irrelevant to the business, so try to ensure that the library is transparent to the application code, and that the library logic is handled at the data access layer as much as possible. Of course it is difficult to do this completely, specifically which should be responsible for the DAL, which is responsible for the application, here are some suggestions:

1) For library access, such as specifying a user ID for a query condition, the SQL requires only access to a specific library. This should be automatically routed to a specific library by the DAL layer, and the application code will not be affected if the MoD factor is changed when the library is split two times.
2) for a simple multi-Library query, the DAL is responsible for summarizing the records returned by each database, and is still transparent to the upper application.
3) for multi-Library queries with aggregation operations, such as keywords with groupby/orderby/min/max/avg, it is recommended that the DAL summarize the results returned by a single library, and that the upper application is further processed. On the one hand the DAL fully supports a variety of cases, the implementation is very complex, on the other hand, from the practice of shop 1th, such examples are not many, in the upper application for targeted processing, more flexible.

The DAL can be further subdivided into JDBC and DAL Layer Two, based on the JDBC level to achieve the sub-library routing, system development is difficult, flexibility is low, there is no good success case; it is generally based on the persistence layer framework to further encapsulate the ddal (Distributed data access layer), to achieve the sub-Library routing, The 1th store Dal is based on the Ibatis for the upper package.

Pagination processing

After the library, some paging queries need to traverse all of the libraries, which are the largest victims of the repository.

To give a page example, such as the requirement to show the order of a business in chronological order, 100 records per page, because it is by business query, need to traverse all the database, assuming the number of libraries is 8, we look at the paging processing logic:

1) If you take the 1th page of data, you need to take the first 100 records from each library in chronological order, 8 libraries are summarized after 800, then the 800 records in the application two order, and finally take the first 100.
2) If you take the 10th page of data, you need to take the first 1000 (100*10) records from each library, the summary has 8,000 records, and then the 8,000 records two times after sorting (900,1000) records.

In the case of the sub-Library, for the K-page records, each library to take 100* (K-1) records, all the libraries add up, more records more, so the more the next paging, the system consumes more memory and execution time.
Compared to the case of no library, whichever page you take, it is easy to fetch 100 records from a single db without having to do two sorting within the application.

How to solve the problem of paging in the case of the sub-Library? There are several ways to do this:

1) If it is to provide paging in the foreground application, then limit the user can only see the front n page, this restriction is also reasonable in business, generally look at the back of the page is not very meaningful (if you must see, you can ask users to narrow the scope of re-query).
2) If the Background batch processing task requires data acquisition in batches, you can increase the page size, such as 5,000 records each time to effectively reduce the number of pages (of course, offline access to the general library, to avoid the impact of the main library).
3) Sub-Library design, there are generally supporting big data platform to summarize all the sub-database records, some paging query can consider the big data platform.

Lookup Mappings

There is only one library field, such as the user ID, but the order table has other fields that uniquely distinguish the record, such as the order ID, given an order ID, and the corresponding record must be in a library. If you blindly query all the sub-libraries, it brings unnecessary overhead, and the lookup map can find the corresponding user ID based on the order ID, thus realizing the single-Library positioning.

All order IDs and user IDs can be retrieved in advance, saved in the lookup table, the lookup tables have the same number of records and the total number of order library records, but it has only 2 fields, so neither storage nor query performance is a problem. When actually used, the lookup performance is generally optimized through distributed caching. For new orders, write a lookup table in addition to the order form.

Overall architecture

Shop number 1th The overall technical architecture of the order level sub-Library is as follows:

1) The upper-level application accesses the database through the order service/Sub-Library agent and DAL.
2) The agent is transparent to the order service, including aggregation operation, non-user ID to User ID mapping.
3) The lookup table is used for order ID/user ID mapping, guaranteed to be accessed by order IDs, can directly fall into a single library, the cache is the memory data image of lookup, improve performance, cache failure, direct access to the lookup table.
4) The DAL provides a route to the library, navigates to a library based on the user ID, and for docu access, the DAL supports optional concurrent access patterns and supports simple record summarization.
5) The lookup table initialization data is derived from the existing library data and is written directly by the agent when new records are added.

On-Line steps

Order form is the core business table, its horizontal split affect a lot of business, its own technical transformation is also very large, it is easy to make a mistake, on-line, must be carefully considered, shop 1th the entire program implementation process is as follows:

First, the Oracle and MySQL two sets of libraries are parallel, all data access points to the Oracle library, and data synchronizer to split the data from Oracle to multiple MySQL sub-libraries, such as 3-minute incremental synchronization.

1) Set up the whole system according to the above structure diagram, select several access examples with low real-time data (such as access to historical orders), turn to MySQL sub-library access, and then gradually add more non-real-time case to test the whole system feasibility.
2) If there is no problem with performance and functionality, turn all real-time read-write access to MySQL and discard Oracle at once.

This on-line step is the development of data Synchronizer (about 1 People weekly workload, the risk is very low), but dispersed the risk, the first step of the technical risk (Lookup/dal infrastructure transformation) and the second step of the business function risk (Oracle change MySQL syntax) separate. Shop 1th Two-stage on-line is a one-time success, especially the second stage on-line, more than 100 relying party application simple restart to complete the upgrade, there is no big problem in the middle.

Project Summary

Shop 1th to complete the order of the level of sub-library, the order from the Oralce to MySQL, the device from the minicomputer to X86 server, through the level of the library and the IoE, not only support the order volume of future growth, and the overall cost is also significantly reduced.

Due to the implementation of the IOE and the order sub-Library, resulting in a double performance impact, we spent a lot of effort to do performance testing, in order to simulate the real scene, we tcpcopy the actual query traffic to the test environment, after 13 rounds of performance testing, the final 6 MySQL library relative to an Oracle , the average SQL execution time is basically flat and performance is not degraded, optimizing the architecture and saving costs.

Before you do a horizontal sub-library of the core table, you must first do the service, that is, the external system through the unified Order service access to related tables, otherwise it is easy to omit some SQL.

Shop 1th is finally based on the user ID after the three-bit modulo, the initial 6 libraries, theoretically support up to 768 libraries, and the order ID generation rules have been modified to include the user ID after three bits, so that the new Order ID itself contains the information needed to locate the library, without the lookup mechanism, with the old Order archive to the History library, The lookup section in the above architecture can be deprecated.

The level of the library is a systematic work, first of all, under the guidance of the theoretical model, the actual situation, each aspect to make the best choice. Second, for special scenes, such as cross-library paging, there is no silver bullet, you can flexibly handle, do not take the conventional road. Finally, control the rhythm, system transformation, data migration, on-line implementation of all links to do a good job, global chess.

Bold design, careful verification, careful implementation, sub-library is not difficult.

Deep interpretation of the horizontal sub-Library & de-IoE under massive data storage and access

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.