SHARDING-JDBC of Sub-database sub-table middleware

Source: Internet
Author: User

Database Library sub-table from the Internet era has been a hot topic. In today's NoSQL, relational databases are still preferred by most companies because of their stability, query flexibility, and compatibility. Therefore, it is an unavoidable problem for major Internet companies to use the sub-database technology to deal with the impact of massive data and high concurrency on the database.

While many companies are committed to developing their own sub-Library sub-table middleware, there is still no perfect open source solution to cover this area.

Sub-Library sub-table applicable scenarios

The Library sub-table is used to address the two common scenarios of the current Internet-large data volumes and high concurrency. It is usually divided into two types: vertical split and horizontal split.

Vertical splitting is the splitting of a library (table) into multiple libraries (tables) based on the business. For example, to split frequently and infrequently accessed fields into different libraries or tables. Due to the close relationship with the business, the current sub-database sub-table products are using horizontal splitting method.

Horizontal splitting is the splitting of a library (table) into multiple libraries (tables) based on the Shard algorithm. such as: According to the last digit of the ID to 3, the mantissa is 1 into the 1th Library (table), the mantissa is 2 into the 2nd Library (table) and so on.

Relational databases have a steep drop in retrieval performance when they are larger than a certain amount of data. In the face of the Internet massive data situation, all the data are stored in a table, it is clear that the database table can easily exceed the amount of data threshold. This single-table can withstand the data volume threshold, according to the database and the concurrency of the difference, through the actual test obtained.

Although the simple table can solve the problem of slow retrieval of data, it can't solve the problem that too many concurrent requests access the same library, causing the database to respond slowly. Therefore, the horizontal split is usually used at least in the form of a library to solve the problem of large data volume and high concurrency. This is also part of the open source Shard database middleware support only the reason for the sub-Library.

But the sub-table also has an irreplaceable application scenario. The most common table requirements are transaction issues. The same library does not need to consider distributed transactions, good at using the same library table can effectively avoid the problem of distributed transactions. The current strong consistency of distributed transactions due to performance problems, resulting in the use of not necessarily more than the library sub-table faster. At present, the ultimate consistency of flexible transactions mostly. Another reason for the existence of a sub-table is that too many database instances are not conducive to operations management. In summary, the best practice is to use the Sub-Library + sub-table reasonably.

SHARDING-JDBC Introduction

SHARDING-JDBC is a database horizontal shard framework, which is separated from the relational database module DD-RDB when applying framework Ddframe, and realizes the transparent database sub-table access. SHARDING-JDBC is the 3rd project of the Ddframe series open source following Dubbox and Elastic-job.

SHARDING-JDBC directly encapsulates the JDBC API, which can be understood as an enhanced version of the JDBC driver, and the legacy code migration cost is almost zero:

    • can be applied to any Java-based ORM framework, such as JPA, Hibernate, Mybatis, Spring jdbc template, or directly using JDBC.
    • Can be based on any third-party database connection pools, such as DBCP, C3P0, BONECP, Druid, and so on.
    • It is theoretically possible to support any database that implements the JDBC specification. Although only MySQL is currently supported, there are plans to support databases such as Oracle, SQL Server, and so on.

The SHARDING-JDBC is positioned as a lightweight Java framework that uses a direct-attached database of clients, provides services in the form of a jar package, no proxy layer, no additional deployment, no other dependencies, and no DBA needs to change the original operation mode.

The SHARDING-JDBC shard strategy is flexible and can support multi-dimensional shards such as equals, between, in, or multiple sharding keys.

SQL parsing function is perfect, support aggregation, grouping, sorting, limit, or and other queries, and support binding table and Cartesian product table query.

Compare with common open source products

In order to show respect for other open source projects, we have no intention of commenting on the items that are still being updated. Here are just a few items that are currently stopping the update, but are still very influential in the database Shard area, see table 1.

Table 1 Database Shard tool Comparison

As you can see from the table above, Cobar belongs to the middle tier scheme and builds a proxy between the application and MySQL. The middle tier between the application and the database, need to do a forwarding, and based on the JDBC protocol and no additional forwarding, directly from the application to connect to the database, performance has a slight advantage. This is not to show that the middle tier is not as direct as the client, in addition to performance, there are many factors to consider, the middle layer is more convenient for monitoring, data migration, connection management and other functions.

Cobar-client, TDDL, and SHARDING-JDBC are all part of the client direct connection program. The benefits of this scenario are portability, compatibility, performance, and minimal impact on DBAs. The implementation of Cobar-client is based on the ORM (Mybatis) framework, which has less compatibility and extensibility than the latter two based on the JDBC protocol.

Implementation principle

The SHARDING-JDBC is a jar file that implements the JDBC protocol, as described in the previous article. The implementation of JDBC-based protocol is slightly different from the middle tier based on MySQL and other database protocols.

Regardless of which architecture, the core logic is very similar, in addition to the protocol implementation layer (JDBC or database protocol), will be divided into the Shard rule configuration, SQL parsing, SQL rewrite, SQL routing, SQL execution and result merging modules.

The overall architecture of the SHARDING-JDBC is shown in Figure 1.

Figure 1 overall architecture diagram of SHARDING-JDBC

Shard Rule Configuration

Sharding-jdbc's Shard logic is very flexible and supports the functions of sharding policy customization, complex Shard key, and multi-operator sharding.

such as: According to the user ID sub-Library, according to the Order ID sub-table such a sub-library sub-table combination of sharding strategy, or according to the annual library, month + User area ID sub-table such as multi-chip key sharding.

SHARDING-JDBC, in addition to supporting the equal sign operator for sharding, also supports In/between operator sharding, which provides more powerful sharding functionality.

SHARDING-JDBC provides a spring namespace for simplified configuration, and a rule engine for simplifying policy authoring. As a result of the current open source Shard Core Logic, these two modules are not open source, waiting for the core stability will be open source other modules.

JDBC Specification rewrite

SHARDING-JDBC's approach to the JDBC specification is to encapsulate five core interfaces for DataSource, Connection, Statement, PreparedStatement, and ResultSet, Incorporating multiple real JDBC implementation class collections (e.g. MySQL JDBC implementation/dbcp JDBC implementation, etc.) into the SHARDING-JDBC implementation class management.

SHARDING-JDBC maximize the implementation of the JDBC protocol, including Addbatch, a batch update feature that will be used in JPA. However, after all, Shard JDBC is different from native JDBC, so there are still non-implemented interfaces, including connection cursors, stored procedures and savepoint-related, resultset forward traversal and modification, and other less commonly used functions. In addition, to ensure compatibility, JDBC 4.1 and its post-release interfaces are not implemented (e.g., JDBC 4.1 is not supported in the DBCP 1.x version).

SQL parsing

SQL parsing, as the core of the sub-list products, performance and compatibility are the most important metrics. At present, the common SQL parser mainly has fdb/jsqlparser and druid. SHARDING-JDBC uses Druid as the SQL parser, and the Druid parsing speed is dozens of times times that of the other two parsers after being tested.

Currently SHARDING-JDBC supports parsing of complex SQL such as join, aggregation (including AVG), order BY, group by, limit, or even query. The SQL parsing that is not expected to occur in the Shard scene is not supported by union, partial subqueries, intra-function shards, and so on.

SQL overwrite

The SQL rewrite is divided into two parts, in part by replacing the logical table name of the table with the real table name. The other part is to replace some of the incorrect functionality in the Shard environment based on the SQL parsing results. Here is a two example:

The 1th example is AVG calculation. In the context of a shard, it is incorrect to calculate the average value in Avg1 +AVG2+AVG3/3, which needs to be rewritten as (SUM1+SUM2+SUM3)/(count1+count2+ count3). This requires that the SQL that contains the AVG be overwritten with sum and count, and then the average is recalculated when the result is merged.

A 2nd example is paging. Let's say that every 10 data is one page and the 2nd page of data is taken. It is incorrect to obtain the limit 10, 10, and then remove the first 10 data according to the sorting criteria in the Shard environment. The correct approach is to rewrite the sub-conditions to limit 0, 20, remove all the first 2 pages of data, and then combine the sorting criteria to calculate the correct data. You can see that the higher the limit paging efficiency will be lower, but also the more wasted memory. There are a number of ways to avoid paging using limit, such as building a two-level index of record row and row offsets, or using the last-paged data end ID as the next query condition.

SQL routing

SQL routing is based on the Shard rule configuration, which locates SQL to a true data source. It is mainly divided into single-table routing, binding table routing and Cartesian product routing.

Single-table routing is the simplest, but routing results do not necessarily fall into the unique libraries (tables) because they support sharding based on operators such as between and in, so the end result can still fall into multiple libraries (tables).

The binding table can be understood as a master-slave table with fully consistent sub-table rules. Example: The order form and the Order Details table are based on the order ID as the Shard key, at any time the Shard logic is the same. Such association queries and single-table queries are quite difficult and performance-like.

The Cartesian product query is the most complex, because the consistency of the Shard rules cannot be located according to the binding relationship, so the associative query of the non-binding table needs to be disassembled for the Cartesian product combination execution. Query performance is low, and the number of database connections is high, you need to use caution.

SQL execution

After routing to the real data source, SHARDING-JDBC will execute SQL in multi-threaded concurrency and complete the processing of batch methods such as Addbatch.

Merge Results

Result merging consists of 4 classes: normal traversal classes, sort classes, aggregation classes, and grouping classes. Each type skips unwanted data first based on pagination results.

The common traversal class is the simplest, simply traversing the collection of resultset in sequence.

The sorting class result sorts the result first and then outputs, because each shard result is sorted according to the respective condition, so the merge sort algorithm is used to integrate the final result.

There are 3 types of aggregation classes, comparative, cumulative, and average. The comparison type includes Max and Min, returning only the maximum (small) result. The additive type includes sum and count, which need to be returned after the result is accumulated. The average value is calculated by the sum and Count of SQL overrides, which are covered in SQL rewriting and are not included in the details.

Grouping classes are the most complex, requiring all resultset results to be put into memory, grouped using the map-reduce algorithm, and related processing based on sorting and aggregation conditions. This is where most of the memory is consumed and the most loss of performance is, consider using limit to restrict the packet size reasonably.

As a result, the merge section does not use pipeline parsing at the moment, and more improvements will be made here.

Performance

Performance test report of routing results in library single table:

Query operation: TPS for SHARDING-JDBC is 99.8% of the TPS for JDBC, insert operation: SHARDING-JDBC TPS is 90.2% of the TPS of the JDBC; Update operation: The TPS for SHARDING-JDBC is 93.1% of the TPS for JDBC, and you can see that the SHARDING-JDBC performance loss is very low.

Performance test report for routing results in multi-Library multi-table:

Query operation: TPS dual Kubidanku can increase performance by approximately 94%; Insert operation: TPS dual Kubidanku can increase performance by approximately 60%; Update operation: TPS dual Kubidanku can increase performance by approximately 89%; the results show that SHARDING-JDBC can effectively utilize multi-threading and distributed resources to significantly improve performance, and more details to view SHARDING-JDBC performance test reports.

Roadmap

At present, SHARDING-JDBC focus on the core logic development of the sub-database, after the stable function will be continuously updated according to the following lines:

    • Read and write separation;
    • Flexible distributed transactions;
    • Distributed primary key generation strategy;
    • SQL rewrite optimization to further improve performance;
    • SQL Hint, which specifies that a SQL is executed in a specific library table, based on business rules rather than SQL parsing routes;
    • Ha-related;
    • Flow control;
    • Database building table Tools;
    • Data migration;
    • Complex SQL parsing support, such as sub-query, stored procedures, etc.;
    • Oracle, SQL Server support;
    • Configuration Center;
Open Source Concept

At present, many open source products in the company have withstood the test of time, and then stripped business logic and sensitive code, and then open source contribution to the community. The advantage of this is that open source products are relatively mature. But the shortcomings are unavoidable, mainly:

    1. Lack of follow-up support. The product has met the company's business scenario needs, the lack of follow-up momentum. Documents and support are also relatively small, even in situations where documents and code are out of sync.
    2. Coupled with the company's business scenario is more serious. Most of the framework products are designed to solve specific problems. For example, some companies may not need a table, and some companies only need to support several sharding strategies.
    3. Open source is incomplete. Parts that are tightly coupled with the company's business will not open source.
    4. Lack of viscosity. The more-formed project is difficult to expand or modify the core logic because of its many functions and complex code structure. If the test coverage is not enough, it is difficult to guarantee the modified code quality. The above series of problems will lead to the low viscosity of the project to the community, it is difficult to find a cooperative development of volunteers.
    5. Many branches are difficult to maintain. Since open source after the company lacks the power of continuous improvement, and the company has little relation to the needs of the function is not paid attention, leading to the development of the company's own branch. Although the open source project initially injected fresh ideas into the community, it did not ultimately absorb the essence of the community. such as: Dubbo The emergence of a considerable amount of attention, and companies have their own version, such as when the Dubbox, but ultimately Dubbo and not sustainable development.

We consider a new open source strategy that will be promoted both to the community and when the SHARDING-JDBC has just finished its first edition. The benefits of doing this are:

    • Follow-up support is complete. Sharding-jdbc with when the internal ground is bound, will be when the internal and the community to provide support at the same time. While it is not possible to provide community demand priority over when internal commitments are made, we will take into consideration the community and internal needs, with a higher perspective, as much as possible to integrate and optimize the upgrade route.
    • Full open source. The snapshot version of the code will appear first on GitHub.
    • Common development. SHARDING-JDBC current code is simpler. Make it easier for community open-source enthusiasts to understand the code core and lay the groundwork for future sustainable development. And SHARDING-JDBC also absorbs the essence of the community, allowing more enthusiasts to participate in the code contribution.

Finally, it is necessary to clarify that SHARDING-JDBC is not a bug pile, completely unavailable project. The current test coverage is over 90%, the detailed features and unsupported items are clearly listed in GitHub's documentation, hoping to keep users in the clear.

Sub-Library sub-table middleware Sharding-jdbc

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.