Objective
When we talk about database schema and database optimization, we often hear "sub-database", "Shard", "sharding" ... Such a keyword. Happily, the number of companies that these friends are serving is (or is about to be) growing at a rapid pace, and technology faces some challenges.
What worries people is that they really need a "sub-database"? Is the "sub-database" So easy to practice? Therefore, the author has collated some problems which may be encountered in the sub-database, and has introduced the corresponding solution ideas and suggestions in the light of previous experience.
Vertical Sub-table
Vertical sub-table in daily development and design is more common, popular parlance is called "Large table Split table", Split is based on the relational database in the "column" (field). Typically, there are more fields in a table, you can create a new extension table, split a field that is not frequently used or has a large length, and place it in the extension table, as shown in:
Summary
In the case of many fields, splitting is really easier to develop and maintain (the author has seen a legacy system that contains more than 100 columns in a large table). In a sense can also avoid the "cross-page" problem (MySQL, MSSQL Bottom is stored through the "data page", "Cross-page" problem may cause additional performance overhead, not here, interested friends can consult the relevant data for research).
The operation of splitting a field is recommended in the database design phase. If it is split in the development process, the need to rewrite the previous query statements, additional costs and risks, and recommended caution.
Vertical Sub-Library
Vertical sub-Libraries in the "microservices" prevalent today is very popular. The basic idea is to classify different databases according to the business module, instead of putting all the data tables in the same database as in the early days. Such as:
Summary
The "service" split operation at the system level can solve the coupling and performance bottleneck at the business system level, which is advantageous to the extended maintenance of the system. And the database level of split, the truth is also interlinked. Similar to the "governance" and "downgrade" mechanisms of services, we are able to "rank" and manage, maintain, monitor, and scale data for different business types.
As we all know, the database is often the most easy to become the bottleneck of application system, and the database itself is "stateful", relative to the Web and application server, it is more difficult to achieve "scale-out". The connection resource of database is valuable and the processing ability of single machine is limited, in high concurrency scenario, the vertical sub-Library can break through the bottleneck of Io, connection number and hardware resource, and it is an important means to optimize database architecture in large-scale distributed system.
Then, a lot of people do not fundamentally understand why to split, also do not have the principle and skill of splitting, just blindly imitate the practice of the manufacturers. Causes many problems after splitting (for example: cross-Library join, distributed transaction, etc.).
Horizontal Sub-table
Horizontal sub-table is also called horizontal table, it is easier to understand, is the table of different rows of data in accordance with a certain pattern of distribution to different database tables (these tables are stored in the same database), so as to reduce the amount of single-table data, optimize query performance. The most common way is to use the primary key or time and other fields to hash and modulo after the split. As shown in the following:
Summary
The horizontal sub-table can reduce the data volume of single table, and can alleviate the query performance bottleneck to some extent. But in essence these tables are also stored in the same library, so the library level still has an IO bottleneck. Therefore, it is generally not recommended to adopt this approach.
Horizontal sub-Library sub-table
The horizontal sub-Library table is the same as the idea of the horizontal table mentioned above, and the only difference is that these split tables are stored in different data. This is what many large internet companies have chosen to do. Such as:
In a sense, some systems use "hot and cold Data separation" (migrating some of the less-used historical data to other databases. And in the business function, usually by default only provides the hot spot data query), also is similar practice.
In the scenario of high concurrency and massive data, the sub-database table can effectively alleviate the performance bottleneck and pressure of single machine and library, and break through the bottleneck of Io, connection number and hardware resource. Of course, the cost of hardware will also be higher. At the same time, this can lead to complex technical problems and challenges (e.g., complex queries across shards, cross-shard transactions, etc.)
The difficulty of the Sub-database table
1, the vertical sub-Library brings the problem and the solution idea
Issues with cross-Library joins
Before splitting, the data required for many lists and detail pages in the system can be done through SQL join. After splitting, the database may be distributed on different instances and on different hosts, and the join becomes cumbersome. It is generally forbidden to cross-Library joins based on architectural specifications, performance, and security considerations. What should we do then?
First of all to consider the vertical sub-library design problems, if you can adjust, then priority adjustment. If unable to adjust the situation, the following author will combine the previous practical experience, summed up several common solutions, and analysis of its application scenarios.
Several solutions to cross-Library join
The so-called global table is the possibility that all modules in the system may be dependent on some of the tables. Compare the "data dictionary" that we understand. To avoid cross-Library join queries, we can save one copy of such tables in each of the other databases. At the same time, this type of data is often rarely modified (or even almost impossible), so there is no need to worry too much about "consistency".
This is a typical anti-paradigm design that is more common in the Internet industry and is typically designed to avoid join queries for performance.
A very simple scenario for an e-commerce business:
The "Seller ID" is saved in the "order Form", and the seller's "Name" field is also redundant, so there is no need to query the "Seller user table" When querying the order details.
Field redundancy can bring convenience and is a kind of "space change Time" embodiment. But the application of the scenario is also relatively limited, more suitable for less dependent fields. The most complex is the data consistency issue, which can be difficult to guarantee by using triggers in the database or at the business code level.
Of course, there is a need for consistency in the context of a business scenario. As in the example above, if the seller modifies the name, does it need to synchronize the updates in the order information?
The Tab_a table in the time a library is associated with tbl_b in the B library and can be timed to synchronize the specified table. Of course, synchronization would have a certain impact on the database, requiring a balance between performance impact and data timeliness. This avoids complex cross-library queries. I have been in the project through the ETL tools to implement.
At the system level, data is fetched and field assembled by invoking components or services from different modules. It's easy to say, but it's not so easy to practice, especially when there are problems with database design, but it can't be easily adjusted.
The specifics are usually more complex. The following author combined with previous practical experience, and by pseudo-code way to describe.
A simple list of query cases
Pseudocode is easy to understand, first get the "My question list" data, and then according to the list of UserID to loop call the dependent User Service to get to the user's realname, assemble the results and return.
An experienced reader can see the efficiency of the appeal pseudo-code at a glance. Loop call service, there may be cyclic RPC, query database ... It is not recommended. Then look at the improved:
This implementation, it seems to be elegant, is actually to change the loop call to one call. Of course, the database query for user Services is likely to be in query, which is more efficient than the previous one. (The ubiquitous in query will be full table scan, there are performance problems, rumors are not all-faith.) In fact, the query optimizer is the basic cost estimate, after testing, in the in statement when the condition field is indexed, the condition of less is going to be indexed. Do not detail here to expand the instructions, interested friends please self-test).
Summary
In the case of simple field assembly, we just need to get the "primary table" data and then, depending on the correlation, call the components or services of other modules to get the other dependent fields (such as the user information dependent on the example) and finally assemble the data.
Typically, we use caching to avoid the overhead of frequent RPC traffic and database queries.
List queries with conditional filtering
In the above example, there is a simple field assembly, and no conditional filtering exists. Look at the SQL before splitting:
With this connection query and conditional filtering, it is very complicated to assemble the data at the code level (especially if the left table and right tables are conditionally filtered) and cannot be assembled as easily as in the previous example. Imagine that if you assemble as simple as above, the result is that the returned data is incomplete and inaccurate.
There are several ways to solve this problem:
Find all the question and answer data, then call the user Service to assemble the data, then filter according to the Filter field State field, then sort and pagination and return.
This approach guarantees the accuracy and completeness of the data, but the performance impact is very large and is not recommended.
Query the State field compliant/non-compliant userid, when querying the question and answer data using in/not in filtering, sorting, paging and so on. After filtering out valid question and answer data, call the user Service to get the data to assemble.
This approach is noticeably more elegant. I have previously implemented this approach in a particular scenario of a project.
Issues with cross-Library transactions (distributed transactions)
After splitting the database by business, the problem of "distributed transactions" is unavoidable. In the past, with the simple configuration of spring annotations in the code to achieve transactions, now it takes a great cost to ensure consistency. No introduction here, interested readers can refer to the Distributed Transaction consistency solution, link address:
Http://www.infoq.com/cn/articles/solution-of-distributed-system-transaction-consistency
Vertical Sub-Library summary and practical suggestions
This article mainly describes several common splitting methods, and emphatically introduces some problems and solutions of vertical sub-Library. Reader friends may have some questions and doubts.
1. Do we need a vertical sub-library for our current database?
According to the system architecture and the company's actual situation, if your system is still a simple monomer application, and there is no traffic and data volume, then do not worry about "vertical sub-Library", otherwise there is no profit, it is difficult to have good results.
Remember, "over-design" and "premature optimization" are a common problem for many architects and technicians.
2. Is there any principle or skill in vertical splitting?
There's no golden rule or standard answer. It is generally the business module splitting of the reference system to split the database. For example "User Service", the corresponding may be "user database". But it is not necessarily strictly one by one correspondence.
In some cases, the granularity of a database split may be thicker than the granularity of the system split. I have also indeed seen some of the system in some of the tables should be placed in a library, but placed in the B library. Some libraries and tables could have been merged, but saved separately. There are also some tables that appear to be OK in a library, and also reasonable in the B library.
How to design and trade-offs, this depends on the actual situation and the level of the architect/developer.
3. The above examples are too simple, our background report system joins the table has n, after the library how to check?
There are a lot of friends who have mentioned similar questions to me. In fact, the Internet business system, should try to avoid join, if there are multiple joins, either the design is unreasonable, or the wrong technology selection. Please do the OLAP and OLTP on your own, the report class system in the traditional bi era is through the OLAP data Warehouse to achieve (now more with the use of offline analysis, streaming computing and other means to achieve), and should not be described above directly in the business library to perform a large number of joins and statistics.
Several common forms of the Library sub-table and the problems that may be encountered