Completeness and consistency issues with the Sub-database table

Source: Internet
Author: User

In a recent project, the database was kept in a sub-library due to the large amount of data being accounted for daily. When data is dispersed across libraries, there is a consistency and integrity of the data update operation. The following is a typical scenario

Suppose there are three physical libraries, now there is a file, there is 1W data, according to the rules of the library, you can divide the data in the file into three libraries, it is necessary to ensure that the 1W data to be fully stored in the three libraries, and the data is consistent, that is to say Three libraries the imported data is exactly the same as the data inside the file.

Under normal circumstances, we first put the data in the file in accordance with the database into three parts, and then for each database to save, in the case of the library, the library can guarantee the integrity of the data. However, three libraries to ensure consistency, is a very complex work, it is likely that the first library of the data saved successfully, but the subsequent three libraries of the data save failed, resulting in the entire file inside the data in the database is incomplete.

How to solve this problem, now think of a few ways:

Scenario 1

Using a distributed thing mechanism like JTA, which means that the relevant database is required to provide support for XA drivers. (XA refers to the specification of distributed transaction processing proposed by the X/open organization). This need to rely on a specific database vendor, is also a relatively simple scenario. After all, complex transaction management can be done through vendors that provide JTA services and database vendors that provide XA drivers. Most of the server vendors that have implemented JTA now, such as JBoss, or an open source JTA Jotm (Java open Transaction Manager)--objectweb, are implemented. However, the introduction of XA-enabled database drivers poses a number of potential problems in the Java Transaction Design strategy: in Java transaction Management, a frequently confusing question is when to use XA and when not to use XA. Because most commercial application servers perform single-phase commit (One-phase commit) operations, performance degradation is not an issue worth considering. However, the non-necessity of introducing XA database drivers into your application can lead to unpredictable consequences and errors, especially when using the local transaction model (Transaction models). Therefore, it is generally necessary to avoid using XA when you do not need it . " so this is an optional solution and the simplest one.

Scenario 2

Create a batch table of files (put in a separate database), save the batch of files to be processed (not the detail data, simply to deal with the file name and the path), each time the data processing, insert a file batch information into the table, and set the state of the file is the initial state , after all the data in the file has been successfully saved to three sub-Libraries, the batch status of the update file is successful. If an exception occurs during a save to a library, the state of the file batch is still in its original state. And the background to start a timer mechanism, scheduled to scan the file batch status, if found to be the initial state, re-execute the import operation of the file until the file is fully imported successfully. This scenario does not seem to be a problem, but there may be duplicate imports, such as batch import to the first sub-Library was successful, the following two libraries failed, re-import, you may repeat the data repeatedly imported into the first sub-library. We can judge between the import, if imported, do not import, but in extreme cases, we can not determine whether the data has been imported, is a flawed scenario, and if the data is imported before each import operation, performance will have some impact. We can also through the exception recovery mechanism, if we find that the file import failed, we delete the imported water, but this also introduces the consistency of error handling problems, such as we have imported a successful 2 sub-library data, in the case of importing the third sub-library failure, to delete the first two sub-library data, There is no way to ensure that it is consistent.

In this scenario, we can make a certain optimization so that it seems to work without problems. First set up a sub-batch table (and the File batch table in the same library), when processing, we put the data of large files according to the library rules to split into three sub-files, each sub-file in the data corresponding to a sub-library. This results in three sub-batch information, which guarantees consistency because the file batch information and the sub-batch information are in the same library. In this way, each pending file is divided into four records, one master file batch information, three sub-batch information, and the status of the information of these batches is the initial state before the data is imported. The import of such a file is decomposed into three sub-files, respectively, into the corresponding library. For each sub-file batch, we can guarantee that the sub-file data are in the same library, to ensure the consistency and integrity of the data within each sub-file, and then after the successful import, the status of the update sub-batch is successful, if all the sub-files of the batch status is successful, then the corresponding file batch status is updated to success. This looks perfect and solves the problem. But think carefully, there is a small detail problem: sub-batch information and a separate library, the data to be imported and sub-batch information may no longer a library, there is no way to ensure that the two operations are consistent, that is, the data inside the sub-file is successfully imported into the library, but may not update the status of the sub-batch information. That sub-batch information can be placed in each sub-library, so that, and then back to the beginning of the question (not explained here, you can go to think about it).

The following diagram is a simple demonstration of the design idea:

Scenario 3

On the basis of the 2nd programme, it could continue to be optimized. First we keep the file batch Information table and the sub-file batch Information table for the second scenario, and we have to put the two tables in the same library (which is assumed to be assigned to the main library) to ensure consistency when we split the task. Then in each sub-library, we create a sub-file batch table of the various sub-Libraries. This table model is basically the same as the sub-file batch Information table of the main library. When splitting the task, the first guarantee the integrity of the main library data, that is, a file batch information records and three sub-file batch records, and then the three sub-file batch information is copied to the corresponding sub-file batch Information table in the library, and then update the main Library sub-file batch information status is "synchronized." Of course, this process is not guaranteed to be consistent. The solution initiates a scheduled task that periodically synchronizes the initial state records in the batch table information of the main library's focus to the sub-file batches table of each sub-library, where the face can result in two situations

1 The Library sub-batch information table already has the same information (this can be guaranteed by a unique primary key), which indicates that the sub-file batch information of the main library is synchronized and the status is "synchronized"

2 Sub-Library sub-batch information does not exist, then insert a piece of data into the corresponding sub-library, and then update the main Library sub-file batch information status to "synchronized"

Then each sub-library is to import the data in the sub-file, in the update sub-Library sub-file batch table status for processing success, both operations as a result of the operation on the library, can guarantee consistency. Finally, the status of the sub-batch information table in the Update master Library is "process successful." Similarly, if you fail to update the sub-batch information status of the main library, you can take a similar timing mechanism to synchronize the status of the Batch information table of the Sub-Library sub-files and the child files of the main library. With this effort, the retry mechanism ensures that the sub-file batches table in the main library is consistent with the sub-file batch table of the library. When all the main library subroutines file batch Information table status is updated to "process succeeded," The file batch status is updated to "process succeeded."

Compared with the second scheme, we have increased the synchronization of data in two libraries, and this mechanism ensures the consistency of the repository data.

Here is a brief introduction to the simple implementation details of the second scenario:

The first is the relational relationship between the database and table structure

The following is a simple demonstration of this process in a scripted way

Let's assume there are four libraries, one main library, three font sub1,sub2,sub3

Main Library Two tables:

File_batch_no, focusing on status state I (initial)->s (success)

Sub_batch_no, focusing on status state I (initial)->r (synchronous success)->s (processing succeeded)

Sub Library Two tables

Data_deail: Saving detail data, which is the main processing table of business logic

Sub_batch_no: Focus on Status State, I (initial)->s (processing success)

1 procedure for splitting a file batch
begindeclarefile_name,batch_no,sub_batch_no;insertintoMAIN.FILE_BATCH_INFO(id,file_name,batch_no,status) values(seq.FILE_BATCH_INFO,#file_name#,#batch_no#,‘I‘);insertintoMAIN.SUB_BATCH_INFO(id,file_name,main_batch_no,status) values(seq.SUB_BATCH_INFO,#file_name#,#batch_no#,#sub_batch_no#,‘I‘);insertintoMAIN.SUB_BATCH_INFO(id,file_name,main_batch_no,status) values(seq.SUB_BATCH_INFO,#file_name#,#batch_no#,#sub_batch_no#,‘I‘);insertintoMAIN.SUB_BATCH_INFO(id,file_name,main_batch_no,status) values(seq.SUB_BATCH_INFO,#file_name#,#batch_no#,#sub_batch_no#,‘I‘);commit;end;

 

2 Synchronize the Sub-batch information of the main library to the corresponding sub-batch information table in the sub-Library of the library, synchronize successfully, update the sub-batch information status of the main library to synchronize successfully.

##分库的操作,从MAIN库SUB_BATCH_INFO表中获取对应的数据插入到SUB1库里面begintransactioninSUB1declarefile_name,batch_no,sub_batch_no;selectSUB_BATCH_INFO.ID intoSUB_ID from MAIN.SUB_BATCH_INFO whereSUB_BATCH_INFO.DATA_BASE = SUB1//判断分库数据是否存在,存在就返回trueif(select* fromSUB1.SUB_BATCH_INFO whereSUB_ID = SUB_BATCH_INFO.ID)  returnSUCCESSinsertintoSUB1.SUB_BATCH_INFO(id,file_name,main_batch_no,status) values(SUB_ID,#file_name#,#batch_no#,#sub_batch_no#,‘I‘);commit;end;##SUB1库的操作完成之后,开始进行MAIN库SUB_BATCH_INFO表对应的update操作begintransaction inMAINdeclareSUB_ID;## R代表已经同步的状态,这里面可以判断status的状态,不过意义不大updateMAIN.SUB_BATCH_INFO setstatus =‘R‘whereID = SUB_IDcommit;end;

Above is just a sub library operation, and if there are multiple libraries, loop through the operation. If a library is not synchronized successfully, there is a timing recovery mechanism. The corresponding SQL of the timing recovery mechanism is to extract the state Sub_batch_info records from main and repeat the process of the above processing.

3 Sub-Library processes the sub-batch information, saves the running water, and then updates the Sub_batch_info record status corresponding to the sub library for processing success. The corresponding Sub_batch_info record status for the main library is then updated to be successful.

+ View Code

Here is the case, the SUB Library and main library also has the problem of state synchronization, here also need a timing to the main library's Sub_batch_info table state to synchronize updates

4 determine if the main library corresponds to the Sub_batch_info all States have succeeded, and if successful, update the FILE_BATCH_NO state of the main library for success.

In these four processes, three timers are required. There are two timers that guarantee data consistency between the main library and the Sub Library, and another timer is responsible for asynchronously updating the consistency of the main library batch and sub-batches.

For a third scenario, you can extract common logic to solve similar scenarios. For example, according to the conditions, delete the various sub-libraries to meet the conditions of the flow, or batch update the various sub-libraries to meet the conditions of the flow. We can abstract these as a task, a specific task consists of n subtasks (n is the number of libraries), the system to ensure that n subtasks either all succeed, or all fail, not allow partial success. In the idea of scenario three, we can build total task tables and sub-task tables, the processing of file import is just one of the task types, bulk Delete, batch update and other similar operations, can be used as a specific task type.

4 The fourth scenario is a two-phase commit idea in the classic distributed transaction design. The two-phase commit has three important sub-operations: Ready to commit, commit, rollback.

Continue to import the file for example, each library as a transactional participant, we need to design the various sub-libraries to prepare the commit operation, commit, rollback operation.

Prepare the commit phase: each sub-Library can save the file details to be processed in a temporary table, and remember the context information in this transaction.

Commit phase: Synchronize the corresponding temporary table data in the transaction context to the corresponding schedule

Rollback phase: Deletes the temporary table pipelining information related to this transaction.

By designing a two-phase commit transaction manager, we can start a distributed transaction when the file is imported, generate a transaction context (which is stored in the database), and then, when invoking the individual sub-actors, need to pass the context information down, The repository is prepared first (that is, the details are saved to the staging table), and if successful, the preparation is successful. When all the participants succeed, the transaction manager commits the transaction, which completes the commit action and inserts the data from the staging table into the formal table. If one of the preparation operations fails, all the libraries perform a rollback operation and delete the imported pipelining.

The most important thing here is that if a sub-library preparation phase returns success, then the submission must be successful, otherwise it can only do data correction or manual processing. This is not a solution in a two-stage transaction.

For different operations, to design the corresponding preparation for submission, submission, rollback operations, the development volume is relatively large, and the implementation of the Distributed transaction manager also need a certain foundation.

The above four scenarios ensure completeness and consistency in only the third and fourth scenarios. In fact, the design ideas of the two schemes are consistent. is through hard retries and asynchronous acknowledgments. Strictly speaking, the third scheme will have some problems, because in the whole process, can only guarantee the final consistency, and there is no way to ensure the isolation of acid inside. Because there is a partial commit, some of this data is likely to be rolled back later. However, the third scenario can be optimized, plus a locking mechanism, but the expansion is more complicated.

Completeness and consistency issues with the Sub-Library table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.