Integrity and consistency problems caused by database/table sharding

Source: Internet
Author: User
Tags database sharding
ArticleDirectory
    • Solution 3
    • 1. Process of splitting file batches

In a recent project, because the amount of data calculated every day is too large, databases need to be stored in sub-databases. When the data is distributed to various databases, the data update operation will have a consistency and integrity problem. The following is a typical scenario:

Assume that there are currently three physical databases, and there is a file containing pieces of data. According to the database sharding rules, you can divide the data in the file into three databases, now we need to ensure that the pieces of data must be completely stored in the three databases, and the data is consistent, that is to say, the imported data in the three databases is exactly the same as the data in the files.

Under normal circumstances, we first divide the data in the file into three copies according to the database to which it belongs, and then store each copy of the database. In the case of a single database, data integrity of a single database can be guaranteed. However, to ensure consistency between the three databases is a very complicated task. It is very likely that the data in the first database is successfully saved, but the data in the next three databases fails to be saved, the data in the entire file is incomplete in the database.

There are several ways to solve this problem:

Solution 1

The distributed transaction mechanism similar to the distributed transaction mechanism provided by JTA is used, that is to say, the relevant database must provide the XA driver. (XA refers to the distributed transaction processing specification proposed by X/open ). This is a simple solution that depends on specific database vendors. After all, complex transaction management can be done by the JTA service provider and the Xa-driven database provider. Currently, most server vendors that implement JTA, such as JBoss or open-sourceJotm (Java open Transaction Manager) -- an Open Source JTA Implementation of objectweb. However, the introduction of database drivers that support XA brings about many potential problems. In Java transaction design policy:In Java transaction management, it is often confusing when XA should be used and when XA should not be used. Because most commercial application servers perform the one-phase commit operation, performance degradation is not an issue worth considering. However, the introduction of the Xa database driver in your application may lead to unpredictable consequences and errors, especially when using the local transaction model. Therefore, when you do not need XA, try to avoid it.."Therefore, this is an optional and simplest solution.

 

Solution 2

create a file batch table (placed in an independent database) and save the batch information of the files to be processed (not detailed data, to put it simply, the file name and path to be processed are inserted into the table each time the file data is processed, and the file status is set to the initial state, after all the data in the file is successfully saved to the three sub-databases, the batch status of the file is successfully updated. If an exception occurs during the process of saving to the sub-database, the status of the file batch is still the initial status. The background starts a timing mechanism to regularly scan the batch status of files. If the status is initial, re-execute the file import operation until the file is fully imported. This solution seems to be okay, but there may be repeated imports. For example, if the batch is successfully imported to the first database shard and the other two databases fail to be imported again, data may be repeatedly imported into the first database shard. we can judge between imports. If the data is imported, it is not imported. However, in extreme cases, we cannot determine whether the data has been imported, it is also a flawed solution. If data is imported before each import, the performance may be affected. We can also use the exception recovery mechanism. If we find that the File Import fails, we will delete the imported stream, but this also introduces the consistency problem caused by error processing, for example, if the data of two sub-databases has been successfully imported and the third sub-database fails to be imported, the data of the first two sub-databases should be deleted, which cannot be ensured to be consistent.

In this solution, we can make some optimizations to make it seem to be operational. First, create a sub-Batch table (in the same database as the file batch table). During Processing, we split the data of large files into three sub-files according to the database sharding rules, the data in each sub-file corresponds to a sub-database. In this way, three pieces of information are generated. Because the file batch information and sub-batch information are in the same database, consistency can be ensured. In this way, each file to be processed is divided into four records, one master file batch information and three sub-batches of information. Before importing data, the status of these batches of information is the initial state. In this way, the import of a file is divided into three sub-files and imported to the corresponding library respectively. For each sub-file batch, we can ensure that the sub-file data is in the same database, to ensure the consistency and integrity of the data in each sub-file, and then after the import is successful, when the status of the sub-batch is updated to successful, if the status of all sub-files is successful, the corresponding file batch status is updated to successful. This looks perfect and solves the problem. However, there is a small detail: the sub-batch information and an independent database, the data to be imported may not be in the same database as the sub-batch information, there is no way to ensure that these two operations are consistent. That is to say, the data in the sub-file is successfully imported to the sub-database, but the sub-batch information status may not be updated. Can I store the sub-batch information in each sub-database? In this case, I will go back to the question I just raised (I will not explain it here, so I can think about it myself ).

The following figure shows the design concept:

 

 

Solution 3

The optimization can be continued on the basis of the 2nd solutions. First, we reserve the file batch info table and the sub-file batch info table of the second solution, and we must put the two tables in the same database (assuming that they are allocated to the master database ), ensure consistency during Task splitting. Then, in each sub-database, we create a sub-file batch table for each sub-database. This table model is basically the same as the subfile batch information table of the master database. When splitting a task, the data integrity of the master database is guaranteed first, that is, a file batch information record and a three-part file batch record are generated, copy the batch information of the three sub-files to the sub-file batch information table of the corresponding sub-database, and then update the sub-file batch information of the master database to "synchronized ". Of course, this process cannot guarantee consistency. The solution starts a scheduled task and regularly synchronizes the initial state records in the sub-file batch table information of the master database to the sub-file batch tables of each sub-database, this may cause two situations.

1. The sub-batch information table of the sub-database already has the same information (this can be ensured by the unique primary key). This indicates that the sub-file batch information of the master database has been synchronized and the status of the sub-file batch information of the master database has been changed to "synchronized"

2. If the sub-batch information of the sub-database does not exist, insert a data record to the corresponding sub-database, and then update the sub-file batch information of the master database to "synchronized"

Then, each sub-database is imported into the sub-file first. When the sub-file batch table of the sub-database is updated, the processing is successful. Both operations are performed on the sub-database to ensure consistency. Finally, the status of the subbatch info table of the master database is updated to "processed successfully ". Similarly, if the status of the sub-batch information of the master database fails to be updated, a similar timing mechanism can be adopted to synchronize the status of the sub-file batch information table and the sub-file batch information table of the master database. This retry mechanism ensures that the sub-file batch tables in the master database are consistent with the sub-file batch tables in the sub-databases. When the status of all the Sub-file batch information tables in the master database is updated to "processed successfully", the batch status of the files is updated to "processed successfully ".

Compared with the second solution, we have added data synchronization in the two databases. This mechanism ensures data consistency in the master database and sub-databases.

Here is a brief introduction to the simple implementation details of the second solution:

The first is the table structure association between databases.

 

The following is a simple script to demonstrate this process.

Assume that there are four databases, one main database, three fonts sub1, sub2, and sub3.

Two tables in the main database:

File_batch_no, mainly focuses on status I (initial)-> S (successful)

Sub_batch_no, mainly focusing on status I (initial)-> r (synchronization successful)-> S (processing successful)

Two sub tables

Data_deail: stores detailed data, which is the main table processed by the business logic.

Sub_batch_no: mainly focuses on the Status status, I (initial)-> S (processed successfully)

1. Process of splitting file batches
Begindeclare file_name, batch_no, sub_batch_no; insert into main. file_batch_info (ID, file_name, batch_no, status) values (seq. file_batch_info, # file_name #, # batch_no #, 'I'); insert into main. sub_batch_info (ID, file_name, main_batch_no, status) values (seq. sub_batch_info, # file_name #, # batch_no #, # sub_batch_no #, 'I'); insert into main. sub_batch_info (ID, file_name, main_batch_no, status) values (seq. sub_batch_info, # file_name #, # batch_no #, # sub_batch_no #, 'I'); insert into main. sub_batch_info (ID, file_name, main_batch_no, status) values (seq. sub_batch_info, # file_name #, # batch_no #, # sub_batch_no #, 'I'); Commit; end;

2. Synchronize the sub-batch information of the main database to the sub-batch information table corresponding to each sub-database of the sub-database. The synchronization is successful. The status of the sub-batch information corresponding to the main database is updated to synchronization successful.

# Database sharding operations: obtain the corresponding data from the sub_batch_info table of the main database and insert it into the sub1 database. Begin transaction in sub1declare file_name, batch_no, sub_batch_no; select sub_batch_info.id into sub. sub_batch_info where exist = sub1 // if the database shard data exists, trueif (select * From sub1.sub _ batch_info where sub_id = exist) return successinsert into sub1.sub _ batch_info (ID, file_name, main_batch_no, status) values (sub_id, # file_name #, # batch_no #, # sub_batch_no #, 'I'); Commit; end ;## after sub1 database operations are completed, start the update operation begin transaction in maindeclare sub_id corresponding to the sub_batch_info table of the main database; # R indicates the status that has been synchronized. The status can be determined here, but it is of little significance to update main. sub_batch_info set status = 'R' where id = sub_idcommit; end;

The above is only a sub-database operation. If there are multiple databases, the operation is performed cyclically. If a database fails to be synchronized, a timed recovery mechanism is available. The SQL statement corresponding to the regular recovery mechanism extracts the sub_batch_info RECORD OF THE STATUS FROM main and repeat the above process.

 

3. The sub-database processes the sub-batch information, saves the stream, and updates the sub_batch_info record of the sub-database to the processing success state. Then, the status of the corresponding sub_batch_info record of the main database is updated to success.

 

# Database shard streamline operations begin transaction in sub1declare file_name, batch_no, sub_batch_no; select sub_batch_info.status into sub_id from main. sub_batch_info where sub_batch_info.data_base = sub1 // determine whether the status is initial if status = 'I' insert into sub1.data _ detail update sub1.sub _ batch_info.status = 's' end ifcommit; end; # after the operations on the sub1 database are completed, start the update operation begin transaction in maindeclare sub_id corresponding to the sub_batch_info table of the main database; # R indicates that the data has been synchronized, the status can be determined here, but it does not make sense to update main. sub_batch_info set status ='s 'where id = sub_idcommit; end;

The situation is the same here, that is, the sub library and the main library also have the status synchronization problem. Here, we also need to regularly update the status of the sub_batch_info table of the main library.

 

4. check whether all the statuses of sub_batch_info corresponding to the main database are successful. If yes, the status of file_batch_no of the main database is successfully updated.

 

In these four processes, three timers are required. There are two timers to ensure data consistency between the main library and sub library, and the other timer is responsible for Asynchronously updating the consistency of the main library batch and sub-batch.

 

For the third solution, general logic can be extracted to solve similar scenarios in the future. For example, you can delete the flow that meets the condition in each database shard or update the flow that meets the condition in each database shard in batches. We can abstract these as a task. A specific task consists of n sub-tasks (n is the number of sub-databases). The system must ensure that n sub-tasks are either all successful, either all failed, not partially successful. In solution 3, we can create a total task table and a sub-task table. The file import process is only one of the task types, such as batch deletion, batch update, and other similar operations, can be used as a specific task type.

 

4. The fourth solution is the two-phase commit idea in the classic Distributed Transaction design. The two-phase commit involves three important sub-operations: Prepare to submit, submit, and roll back.

We will continue to use File Import as an example. As a transaction participant, we need to design the preparation, submission, and rollback operations for each database shard.

Prepare for submission: each sub-database can save the file details to be processed to a temporary table and remember the context information in this transaction.

Commit phase: Synchronize the temporary table data in the transaction context to the corresponding schedule

Rollback phase: deletes the temporary table flow information related to this transaction.

By designing a two-phase commit Transaction Manager, we can start a distributed transaction during File Import to generate a transaction context (the context information should be saved to the database ), then, when calling each sub-Participant, you need to pass the context information and prepare the sub-database (that is, save the details to the temporary table). If the sub-database is successful, the preparation is successful. When all the participants are successful, the Transaction Manager submits the transaction. This sub-database completes the commit action and inserts data from the temporary table to the formal table. If a preparation operation fails, all database shards perform the rollback operation and delete the imported streamline.

The most important thing here is that if a successful result is returned in the database shard preparation phase, the submission must be successful. Otherwise, only data correction or manual processing can be performed. There is no solution in the two-phase transaction.

For different operations, it is necessary to design the corresponding preparation, submission, and rollback operations. The development volume is large, and the implementation of the Distributed Transaction Manager also requires some knowledge.

 

Only the third and fourth solutions can ensure the integrity and consistency of the above four solutions. In fact, the two solutions have the same design philosophy. It is done through retry and asynchronous validation. Strictly speaking, the third solution has certain problems, because the ultimate consistency can only be ensured throughout the processing process, but there is no way to ensure the isolation of acid. Some data may be rolled back later. However, you can optimize the third solution and add a lock mechanism, but the expansion is complicated.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.