Implementation of original subcommit in SQLITE

Source: Internet
Author: User



1. Introduction

An important feature of a database that supports transactions like SQLITE is "Atomic commit ". Atomic commit means that all modification actions in a transaction either occur or none. With Atomic commit, multiple write operations on different parts of a database file will be completed in an instant. Of course, in reality, the storage hardware will serialize write operations, and it will take a short time to write each sector. Therefore, it is absolutely impossible to "complete at the same time. However, the atomic commit logic of SQLITE makes the entire process look like that.

SQLITE ensures that the entire transaction is atomic even if the operating system crashes or powers down during the transaction execution. This article describes the technology used by SQLITE to implement atomic commit.

2. Hardware assumptions

Although flash memory is sometimes used, we will refer to the storage device as a "disk" in the following article ".

We assume that the write operation on the disk is in the unit of "Sector". That is to say, it is impossible to directly modify the disk smaller than one sector, you must read the entire slice into the memory, make the necessary changes, and then write the entire slice back.

For a real "disk", the smallest unit of read/write operations is one slice. However, the smallest unit of read and write operations is generally much smaller than the smallest unit of write operations in flash memory. SQLITE only cares about the minimum write unit. Therefore, when we say "slice" in this article, it refers to the minimum data volume when writing data to the memory.

Prior to version 3.14, SQLITE considered the size of a sector to be 512 bytes under any circumstances, and there was a compilation option that could change this value, but no one has ever tested the code with a larger value. Until recently, it was reasonable to set this value to 512 because all disk drives use 512-byte sectors internally. Recently, however, the size of the disk sector has been increased to 4096 bytes, and the size of the flash sector is generally greater than 512 bytes. For these reasons, since version 3.3.14, the operating system interface layer of SQLITE provides a method for obtaining the real sector size from the file system. However, until now (version 3.5.0), this method still returns a hard-coded 512 byte, because both win32 and unix systems, there is no standard mechanism to obtain the actual value. However, this method provides embedded device providers with the ability to adjust based on actual conditions, and makes it possible for us to provide a more meaningful implementation on win32 and unix in the future.

SQLITE does not assume that write operations on the slice are atomic. It only assumes that the write operation is linear. Linear refers to: When writing a sector, the hardware always starts from one end of the sector, one byte and one byte is written to the end of the other end, and the middle will not go back. The hardware can be written from the beginning to the end, it can also be written from the end to the header. If the power loss occurs only in the middle of the slice, some of the slice may be modified but the other is not modified. A key assumption made by SQLITE here is that as long as the sector is modified, at least one of its first and last bytes will be modified, that is, the hardware will never start from the middle to the two ends. We don't know if this assumption is always correct, but it seems reasonable.

In the previous section, we said, "SQLITE does not assume that the write sector is atomic ". This is correct by default, but in version 3.5.0, we added an interface called "Virtual File System (VFS, it is the only path for communication between SQLITE and the underlying file system. The Code contains the default VFS implementation for unix and windows, and provides a mechanism to create a new VFS implementation at runtime. In this new VFS interface, there is a method called "xDeviceCharacteristics", which queries the file system to determine whether the file system supports certain features. If the file system supports a feature, SQLITE will try to optimize it. The default xDeviceCharacteristics does not indicate that the file system supports atomic write sector operations, so the related optimization is disabled.

SQLITE assumes that the operating system will buffer write operations, and write operations will be returned before the data is actually written to the disk. SQLITE also assumes that write operations will be recorded by the operating system. Therefore, SQLITE will execute "flush" or "fsync" at key points ", it is assumed that "flush" and "fsync" will wait until all ongoing "write operations" are actually completed before returning. In some versions of windows and unix, the "flush" and "fsync" primitives are interrupted. Unfortunately, if a power loss occurs during the submission process on these systems, SQLITE databases may crash, while SQLITE itself cannot. SQLITE assumes that the operating system can be as perfect as advertising. If this is not the case, you have to pray to God not to lose power too often.

SQLITE assumes that when the file grows, the newly added part initially contains junk data and will be overwritten by the actual data. In other words, SQLITE assumes that the file size changes before the file content changes. This is a pessimistic assumption. In order to ensure that the system will not cause database crashes during the period from "File Size Change" to "file content writing, SQLITE does some extra work. XDeviceCharacteristics of VFS may also point out that the file system always writes data first and then updates the file size. In this case, SQLITE can skip some overly careful database protection operations, this reduces the number of disk I/O required for one commit. However, this assumption is not made in the VFS implementation on windows and unix.

SQLITE assumes that file deletion is atomic, at least from the perspective of the user program. That is to say, if SQLITE wants to delete a file and the file loses power during the deletion process, the file cannot be found in the file system after the power is restored, or its content is exactly the same as before the deletion. If the file can still be found in the file system, but the content is modified or cleared, the database is very likely to crash.

SQLITE assumes that detecting bit errors caused by cosmic rays, thermal noise, and driver bugs is the responsibility of the operating system and hardware. SQLITE does not add any redundant information to database files to detect or correct such problems. SQLITE assumes that the data it reads is always exactly the same as the data it wrote last time.

3. Submit a single file

Let's take a look at the steps to ensure the atomicity of transaction commit when SQLITE operates on a separate database file. In order to prevent the file from being damaged when the power is down, the file format should also be taken into consideration during design. Relevant details and multi-database submission technology will be discussed in the subsequent sections.

3.1. Initial status

Displays the status of the computer when the database connection is started. The rightmost side of the figure is the data stored on the disk. Each cell represents a sector, while blue indicates that the sector stores raw data. The middle part of the figure is the cache of the operating system, in the current example, the cache is "cold", so each cell is not colored; the leftmost is the process using SQLITE, so on windows, it would be better to replace some "processes" in the original article with "Threads". I didn't do this replacement, therefore, you need to determine the specific meaning of the "process" based on the context during the reading process.) The database connection has just been created and no data has been read, therefore, there is nothing in the user's memory space.

3.2. Get a "Read lock"

Before writing data to a database, SQLITE must first read data so that it can know what is already in the database. Even if data is simply appended, SQLITE should first read the table structure of the database from the sqlite_master table to know how to parse the INSERT statement and where the new data should be saved to the file.

The first step in read operations is to obtain the "share lock" of a database file ". This shared lock allows two or more databases to connect to read database files at the same time, but other databases are not allowed to connect to write this file. This lock is very important, because if another connection writes data during data reading, we may read a mixture of new data and old data, this will lead to the loss of atomicity for write operations on other connections.

Please note that the shared lock is implemented by the disk cache of the operating system, rather than the disk itself. Generally, file locks are only some of the symbols in the operating system kernel (The details depend on the interface layer of the specific operating system ). Therefore, when the system crashes or powers down, the lock automatically disappears. In addition, the lock automatically disappears after the process that creates the lock exits.

3.3. read data from the database

After obtaining the shared lock, we start to read data from the database file. In this example, we assume that the initial cache is "cold", so we need to first read the data from the disk to the operating system cache, and then copy the data from the cache to the user space. In subsequent read operations, because some or all of the data may already be in the cache, you may only need to copy the data from the cache to the user space.

Generally, we do not need all pages of database files, so we only read a subset of it. In this example, we have 8 pages of database files, and we need 3 of them. A real database may have thousands of pages, but each query usually only has a small part.

3.4. Get a reservation lock

SQLITE needs to obtain a pre-lock before making any changes to the database. The pre-locks are similar to shared locks. They allow other processes to read database files. In addition, the pre-locks can coexist with multiple shared locks. However, a database file can only have one predefined lock at a time, that is, only one process is allowed to write data.

The purpose of the lock is to tell the whole system that a process is about to modify Database files in the near future, but there is no actual action yet. Because it is only an "intention", other processes can continue their own read operations, but they cannot have this intention.

3.5. Create a rollback log (Journal) File

Before any material modification, SQLITE also needs to create an independent log rollback file and write the original content of all the database pages to be replaced to this file. In fact, the log file stores all the information required to restore the database file to its original state.

The log file has a small File Header (green in the figure), which records the original size of the database file. If the database file becomes larger because of the modification, we can still use it to obtain the original file size. The database pages and their corresponding page numbers are put together and written to the log file.

When creating a new file, most operating systems (such as windows, linux, and macOSX) do not write data to the disk immediately. A new file exists only in the operating system cache at the beginning. It will not create this file on the disk until the operating system is idle. This method makes users feel that file creation is very fast, at least much faster than disk I/O. To express this situation, we only drew this log file in the operating system cache.

3.6. Modify the database in the user space

After the original content of the database page is saved to the log file, it can be modified in the user space. Each database connection has a private user space copy. Therefore, these modifications are only visible to the current connection. Other connections still see unmodified content in the operating system cache. In this case, although a process is modifying the database, other processes can continue to read the original content of the database.

3.7. Click "refresh" the log file to the disk.

The next step is to fl the content of the rollback log file to a persistent storage. As you will see later, this is one of the key to making the database survive in the case of power loss. It may take a lot of time, because writing to persistent storage is generally slow.

This step is generally more complicated than simply flushing rollback logs to the disk. On most platforms, You need to flush (or fsync) twice. The first time is the basic content of the log file. Then, modify the header of the log file to reflect the actual number of pages in the log file. Next, refresh the file header for the second time. We will discuss the reasons for modifying the file header and refreshing it once.

3.8. Obtain an exclusive lock

To make real changes to database files, we need an exclusive lock. Two steps are required to obtain the lock. The first step is to obtain a Pending lock and then escalate it to an exclusive lock.

The pending lock allows other processes that already have a shared lock to continue reading database files, but it does not allow creation of new shared locks. It is designed to prevent a large number of read processes from sending the write process to the hunger. There may be dozens or even hundreds of processes in the system that want to read database files. Each of these processes must go through a process of "getting shared locks, reading data, and Releasing locks. If many processes want to read the same database file, it is very likely that the new process always obtains a new shared lock before the existing process releases the shared lock. In this way, there will always be a shared lock on the database files, and the process for writing data may never have the opportunity to get its own exclusive lock. By prohibiting the creation of new shared locks, the pending locks solve this problem. The existing shared locks will be gradually released. Finally, when they are all released, the lock can be upgraded to the exclusive lock.

3.9. Update database files

Once an exclusive lock is obtained, no other process can read the database file, and updating it is safe. In general, updates here will only affect the disk cache layer of the operating system, but will not affect the physical files on the disk.

3.10. fl changes to the memory

In order to write database changes to persistent storage, we need to refresh it again. This is also the key to ensuring that the database will not crash when the power is down. Of course, writing data to the disk or flash memory is too slow. This step, together with the fl log file in section 3.7, will consume most of the time that SQLITE commits a transaction.

3.11. Delete log files

After securely writing all changes to the memory, you can delete the rollback log file. This is the time point at which the transaction is committed. If the power is down or the system crashes before this happens, the subsequent recovery process will bring the database file back to the state before modification, as if nothing had happened. If the power is down or the system crashes after the log file is deleted, all modifications will take effect. Therefore, whether SQLITE modifies the database is all valid or all invalid depends on whether the log file exists.

Deleting a file is not necessarily an atomic operation, but from the perspective of a user program, it seems to be always atomic. The process can always ask the Operating System "does this file exist ?" And wait until the answer is yes or no. If a power loss occurs during the transaction commit process, SQLITE will ask whether the operating system has a rollback log file. If yes, the transaction is incomplete and needs to be rolled back, if the transaction does not exist, the transaction is successfully committed.

The implementation of SQLITE transactions depends on the existence of rollback log files and the deletion of atomic files in the eyes of the user program. Therefore, a transaction is also an atomic operation.

3.12. Release the lock

The last step is to release the exclusive lock so that other processes can access the database files again.

In, we can see that the data in the user space is cleared after the lock is released. If it is an earlier version of SQLITE, this is the actual situation. But since the last few versions, SQLITE does not do this, because they may be used in the next operation. The performance of reusing data that is already in the local memory is much higher than reading data from the operating system cache or disk. Before using them again, we need to get a shared lock first, and then check whether other processes have modified the database files during the period when we have no lock. There is a counter on the first page of the database, which is incremented every time the database is modified. Check this counter to see if the database has been modified by other processes. If you have modified the data, you must clear the data in the user space and read the new data. However, it is more likely that there is no modification, so that the original data can be reused, thus greatly improving the efficiency.

4. rollback

Atomic commit seems to be completed in an instant, but it is obvious that the process described above takes some time to complete. If the power supply is cut off during the commit process, in order to make the entire process look instantaneous, We must roll back those incomplete changes and restore the database to the State before the start of the transaction.

4.1. If something goes wrong...

Assume that the power loss occurs in the step described in section 3.10, that is, when the database changes are flushed to the disk. After the power is restored, the situation may be as shown in. We want to modify three pages of data, but only one page is successfully completed. One page is only written in part, and the other page is not written at all.

The log file after power recovery is complete, which is the key. The operation in section 3.7 is to ensure that all the log Content is securely written to the persistent storage before any changes are made to the data file.

4.2. "Hot" rollback log

Before any process accesses a database file for the first time, it must obtain a shared lock described in section 3.2. Then, if a log file is found, SQLITE will check whether the rollback log is "hot ". We must replay the hot log file to restore the database to a consistent state. Hot log files are generated only when a program loses power or crashes while committing a transaction.

Log files are hot when all of the following conditions are met:
● Log files exist.
● Log files are not empty files
● There is no pre-lock on the database file
● The Log File Header does not contain the name of the main log file, or, if there is a main log file name, the main log file exists.

The hot log file tells us that a process tried to commit a transaction, but for some reason, the commit was not completed. That is to say, the database is in an inconsistent state and must be repaired (rolled back) before use ).

4.3. Obtain the exclusive lock on the database.

The first step in processing hot logs is to obtain the exclusive lock on the database files, which can prevent two or more processes from playing back one hot log at the same time.

4.4. Incomplete rollback modifications

After obtaining the exclusive lock, the process has the right to modify the database file. It reads the original content of the page from the log, and then writes them back to their original location in the database file. As mentioned above, the header of the log file records the size of the database file before the start of the transaction. If the modification increases the size of the database file, SQLITE will use this information to cut the file to the original size. After this step, the database file should be as big as before the start of the transaction, and contain the same data as at that time.

4.5. Delete hot log files

All the information in the log is replayed to the database file, and the database file is flushed to the disk (it may be power-down again during rollback), you can delete the hot log file.

4.6. Keep moving forward, just as the interrupted transaction has never happened.

The last step of rollback is to downgrade an exclusive lock to a shared lock. Since then, the database status looks like the interrupted transaction has never started. Since the entire rollback process is completely automatic and transparent, the program using SQLITE won't even know that there is a transaction interruption and rollback.

5. Submit multiple files

Using the ATTACHDATABASE command, SQLITE allows a single database to connect to multiple database files. When multiple files are modified in a transaction, all files are updated by atoms. In other words, or all files will be updated, or one will not be updated. It is more complex to implement atomic commit on multiple files than to implement it on a single file. This chapter will explain how SQLITE achieves this.

5.1. One log per Database

When a transaction involves multiple database files, each database has its own rollback log, and their locks are also independent. It shows how the three database files are modified in one transaction. The State described by the three database files is equivalent to the state of a single file transaction in section 3.6. Each database file has its own predefined lock. The original content of the pages to be modified has been written into the rollback log, but it has not been flushed to the disk. The data in the user memory has been modified, but the database file has not changed.
Compared with the previous one, it is simplified. In this figure, Blue still represents raw data, and pink still represents new data. However, the rollback log and database pages are not shown above, and the data in the operating system cache and on the disk are not clearly distinguished. All of this applies to this graph, but even if we draw them out, we cannot learn anything new. Therefore, we omit them to narrow down the graph.

5.2. Main Log File

The next step in Multi-file submission is to create a "master log file ". The name of this file is the name of the initial database file (that is, the database opened with sqlite3_open (), rather than the one appended later) with the suffix "-mjHHHHHHHH ". HHHHHHHH is a 32-bit hexadecimal random number. It is different every time a new primary log file is generated.

(Note: The method used in the above section to generate the name of the main log file is used in version 3.5.0. This method is not standardized, nor is it part of the external interface of SQLITE. In future versions, we may modify it .)

The master log does not contain information related to the raw database page. It stores the complete paths of all log files involved in the transaction rollback.

After the master log is generated, it is immediately flushed to the disk without any other operations. On unix systems, the directory where the main log is located will also be synchronized to ensure that it will also appear in this directory after power loss.

5.3. Update the rollback Log File Header

The next step is to record the path of the master log to the file header of the rollback log. When the rollback log is created, the corresponding space is reserved in the file header.

Before and after the master log path is written to the rollback Log File Header, You need to refresh the rollback log content to the disk. This may cause some efficiency loss, but it is very important. Fortunately, only one page (the first page) of data changes for the second time, therefore, the entire operation may not be as slow as you think.

This operation is roughly equivalent to Step 1 when a single file is submitted, that is, the content in section 7th.

5.4. Update database files

After flushing the rollback log to the disk, You can securely update the database files. We need to obtain the exclusive lock on all database files, write data, and fl the data to the disk. This step is equivalent to steps 8th, 9, and 10 when a single file is submitted.

5.5. Delete the main Log File

The next step is to delete the master log file, which is the time point when the multi-file transaction is actually committed. It is equivalent to Step 1 when a single file is submitted, that is, the step for deleting the log file.
If the power is down or the system crashes after this occurs, the transaction will not be rolled back even if the log file is rolled back. The difference here is that the file header of the rollback log contains the path of the master log. SQLITE only considers the rollback log (when a single file is submitted) without the path of the main log file in the file header or the rollback log still exists in the main log file as "hot ", and only the hot rollback logs will be played back.

5.6. Clear rollback log files

The last step is to delete all rollback log files and release the exclusive lock so that other processes can detect data changes. This step corresponds to Step 1 when a single file is submitted.

As the transaction has been committed, it is not very urgent to delete these files. The current implementation is to delete a log file, release the exclusive lock on its corresponding database file, and then process the next one. In the future, we may change it to deleting all log files first, and then releasing the exclusive lock. Here, as long as the log file is deleted before it is released, the corresponding locks are released. The order in which the file is deleted or the order in which the lock is released is not important.

6. More details in the submission

Chapter 2 introduces the implementation of SQLITE atomic commit in general, but misses several important details. This chapter will provide some additional instructions for them.

6.1. The entire slice is always recorded in the log

When you write the original content of the database page into the rollback log, SQLITE writes the complete slice even if the page is smaller than the slice. In the past, the sector size in SQLITE was hard-coded 512 bytes, and the minimum page size was 512 bytes, so there was no problem. However, since version 3.3.14, SQLITE also supports storage with a slice size of more than 512 bytes. Therefore, from this version, when any page in a slice is written into the log, other pages in this sector will also be written together.

Power loss may occur when writing a sector, and the entire sector is always recorded. In this case, the database is not damaged. For example, assume that each slice has four pages, and now the page 2 has been modified. To write the changes to this page, the underlying hardware can only write the complete slice, page 1, page 3, and page 4 will be rewritten. If the write operation is interrupted, the data on these three pages may be incorrect. To avoid this problem, you must write all the pages in the slice into the rollback log.

6.2. Junk data in the log file

When appending data to the end of a log file, SQLITE generally assumes that the file system will use junk data to increase the file size and overwrite the garbage with the correct data. In other words, SQLITE assumes that the file size increases first, and then the actual content is written. If the power loss occurs when the file has become larger but the data is not written, the rollback log contains junk data. After the power is restored, another SQLITE process will find the log file and attempt to restore it. This may copy the junk data to the database file and cause damage to it.

To solve this problem, SQLITE has established two lines of defense. First, SQLITE records the actual number of pages in the rollback log file header. This number is 0 at the beginning. Therefore, when you play back an incomplete rollback log, SQLITE will find that the file does not contain any pages and will not make any changes to the database. Before submission, rollback logs are flushed to the disk to ensure that there is no garbage in the logs. Then, the number of pages in the file header is changed to the actual value. The file header is always stored in a separate sector. Therefore, if a power loss occurs when it is overwritten or flushed to the disk, other pages will not be damaged. Note that the rollback log should be flushed twice to the disk: the first is the original content of the page, and the second is the number of pages in the file header.

The above section describes the situation when the synchronization option is set to "full" (PRAGMAsynchronous = FULL), which is also the default setting. However, when the synchronization option is lower than "normal", SQLITE will only fl the log file once, that is, the time after the number of pages is modified. Because the number of pages (greater than 0) may arrive on the disk before other data, this poses a certain risk. SQLITE assumes that the file system records write requests, so even if you write data first and then write the number of pages, the number of pages may be first recorded by the disk. Therefore, as the second line of defense, SQLITE records a 32-bit verification code for each page of data in the log file. When you roll back a log file, SQLITE will check this verification code. Once an error is found, the rollback operation will be abandoned. It should be noted that the verification code cannot completely guarantee the correctness of the page data. The probability that the verification code is correct is extremely small, but not zero .. However, the verification code mechanism should at least make similar things seem less likely to happen.

When the synchronization option is set to "full", there is no need to use the verification code. We only need it when the synchronization option is lower than "normal. However, because verification codes are harmless, they always appear in the rollback log regardless of the synchronization options.

6.3. cache overflow before submission

The procedure described in Chapter 3 assumes that all database changes before submission can be stored in the memory. This is generally the case, but it will also happen in special cases. At this time, the database changes will run out of the user cache before the transaction is committed. You need to write the content in the cache to the database in advance.

Before the operation, the database connection is in the status of step 3.6: the content of the original page has been saved to the rollback log, and the modified page is located in the user memory. To recycle the cache, SQLITE executes steps 3.7 to 3.9, that is, flushing the rollback log to the disk, obtaining the exclusive lock, and then writing the changes to the database. However, the subsequent steps are different before the transaction is actually committed. SQLITE will append a file header (using a separate sector) at the end of the log file, and the exclusive lock will continue to be retained, and the execution process will jump to step 3.6. When the transaction commits or recycles the cache again, steps 3.7 and 3.9 will be repeated (because the cache is obtained with an exclusive lock for the first time and has not been released, step 3.8 will be skipped ).

Increasing the reserved lock to an exclusive lock will reduce the concurrency, and the extra disk flushing operation will be very slow. Therefore, the recovery of cache will seriously affect the system efficiency. Therefore, SQLITE will not use it whenever possible.

7. Optimization

The performance analysis of the program shows that SQLITE consumes most of the time on disk I/O in most systems and most cases. Therefore, reducing the number of disk I/O is the most likely way to greatly improve efficiency. This chapter describes some techniques used by SQLITE to reduce disk I/O while ensuring atomic commit.

7.1. Keep the cached data between transactions

In section 3.12, we said that when the shared lock is released, all database information already in the user cache will be discarded. This is because when there is no shared lock, other processes can modify the content of database files at will, resulting in the cache data being out of date. Therefore, whenever a new transaction starts, SQLITE must read the previous one again. This operation is not as bad as you think, because the data to be re-read is likely to be still in the operating system cache, the so-called "re-reading" generally only copies data from the kernel space to the user space. However, even so, it may take some time.

From version 3.3.14, we have added a mechanism in SQLITE to avoid unnecessary rereading. In these versions, after the shared lock is released, the pages cached by the user are retained. After SQLITE starts the next transaction and obtains the shared lock, it checks whether other processes have modified database files. If it has been modified since the last lock release, the user cache will be cleared and re-read. However, there is usually no modification, so the user cache is still valid, so many unnecessary read operations are avoided.

To determine whether the database file is modified, SQLITE uses a counter in the file header (24th to 27 bytes), and each modification operation increments it. Before releasing the database lock, SQLITE will write down the value of this counter. After the lock is obtained again, It will compare the recorded value with the actual value. If it is the same, the existing cache data will be reused, clear the cache and re-read it.

7.2. Exclusive Access Mode

Since version 3.3.14, SQLITE has added the "exclusive access mode ". In this mode, SQLITE retains the exclusive lock after the transaction is committed. In this way, other processes cannot access the database. However, since most deployment solutions only have one process to access the database, there is usually no problem. The exclusive access mode makes it possible for the following three methods to reduce disk I/O:
1) In addition to the first transaction, you do not have to increment the counters in the database file header each time. This usually means that the page 1 is skipped in the database files and rollback logs.
2) because no other process can access the database, it is not necessary to check the counter and clear the user cache each time a transaction is started.
3) after the transaction ends, you can roll back the log file by setting the file length to 0 bytes rather than deleting it. In many operating systems, truncation is much faster than deletion.

The third optimization, that is, replacing the deletion with truncation, does not require an exclusive lock. Theoretically, it is possible to always implement it, rather than simply implement it in the exclusive access mode. Maybe we will make it a reality in future versions. However, so far (version 3.5.0), this optimization is only effective in the exclusive access mode.

7.3. Do not record idle pages

When deleting data from a database, pages that are no longer used will be added to the "Idle page table. The subsequent insert operations will first use these pages, instead of expanding database files. Some idle pages also contain important data, such as the location of other idle pages. However, most idle pages are useless. We call these pages "leaf pages ". Modifying the leaf page does not affect the database.

Since the leaf pages are useless, SQLITE does not record them in the rollback log in step 3.5 of the submission process. That is to say, it is harmless to modify the leaf pages but not recover them during rollback. Similarly, the content of a new leaf page is neither written to the database nor read out in step 3.9. When the database file has free space, this optimization greatly reduces the number of disk I/O.

7.4. Single-page update and atomic sector write

Starting from version 3.5.0, the new VFS Interface contains a method named xDeviceCharacteristics, which can report whether the underlying storage supports some features. One of these features is "Atomic sector write ".

As we mentioned earlier, SQLITE assumes that the write sector is linear, not atomic. Linear write starts from one end of the slice and ends with one byte. If power loss occurs in the middle of linear writing, one end of the slice may be modified, but the other end remains unchanged. However, in the case of atomic writing, the sector is either completely updated or completely unchanged.

We believe that most disk drives now implement atomic sector writing. When the power is down, the drive uses the electric energy in the capacitor and the kinetic energy of (or) disk rotation to complete the ongoing operation. However, there are too many layers between system write calls and disk electronic components, so we make a conservative assumption on the default VFS Implementation of Unix and windows, and think that the write sector is not atomic. On the other hand, device vendors who have more say in the file systems they use may choose to enable this option in xDeviceCharacteristics if their hardware does support atomic sector writing.

When the write sector is atomic, the database page is as large as the fan area, and the database changes only involve one page, SQLITE skips the entire logging and synchronization process, directly write the modified page to the database file. The modification counter on the first page of the database file is also modified independently, because it is harmless even if the power is down before it is updated.
In my opinion, if the hardware does not support atomic sector writing, it cannot implement absolute atomic commit at the software level.

7.5. Support for secure append file systems

Another optimization measure added to version 3.5.0 is based on the "secure APPEND" function of the file system. SQLITE assumes that when you append data to a file (especially a rollback log file), the file size changes before the file content increases. Therefore, if the power is down and the file becomes larger, the file contains junk data before the data is written. You can also use the xDeviceCharacteristics method in VFS to point out that the file system supports the "secure APPEND" function, which means that the increase of content is earlier than the size change. Therefore, it is impossible to introduce garbage into log files due to power failure or system crash.

When the file system supports Secure append, SQLITE always fills in-1 in the digital segment of the log file header, indicating that the number of pages to be processed during rollback should be automatically calculated based on the size of the log file. This-1 will not be modified, so we do not need to refresh the first page of the log file separately when submitting the log file. In addition, when cache is recycled, there is no need to write a new file header at the end of the log file. We only need to append a new page to the existing log file.

8. test atomic commit

As a developer of SQLITE, we are confident in the robustness of power loss and system crash, because our automatic test process is under a simulated power failure, the restoration capability is measured in many ways. We call this simulated fault a crash test ".

The crash test uses a modified VFS to simulate various file system errors that may occur during power failure or crash. It can simulate a sector with no complete write operations, a page containing junk data due to incomplete write operations, and write operations with wrong sequence, these errors may occur at various path points in the test scenario. The crash test keeps executing transactions, causing simulated power loss or system crashes to occur at different times, resulting in various data damages. After a simulated crash event occurs, the test program re-opens the database and checks whether the transaction is completely completed or (seemingly) not started at all, that is, whether the database is in a consistent state.

The SQLITE crash test helped us find many minor problems in the recovery mechanism (which have been fixed now ). Some of them are very obscure and may not be found through code check and analysis alone. These experiences convince SQLITE developers that database systems that do not use crash tests are very likely to include bugs that cause database damage when the system crashes or power is down.

9. Possible Problems

Although SQLITE's atomic commit mechanism is robust, it may be overwhelmed by malicious opponents or imperfect operating systems. This chapter describes several situations that may cause database damage when the power is down or the system crashes.

9.1. Problematic lock

SQLITE uses the file system lock to ensure that only one process and database connection can modify the database at a certain time point. The file system lock mechanism is implemented at the VFS layer and varies with each operating system. The correctness of SQLITE depends on the correctness of this implementation. If two or more processes can modify a database file at the same time, the database will be seriously damaged.

Someone reported to us that windows's Network File System and (Unix,) NFS locks have some problems. We cannot verify these reports, but we cannot deny them considering the difficulty of implementing a correct lock on the Network File System. Because network file systems are inefficient, we recommend that you avoid using SQLITE on them. If you must do this, consider using an additional locking mechanism to ensure that even if the file system's own locking mechanism does not work, it does not cause multiple processes to write one database file at the same time.

SQLITE pre-installed on Mac OSX allows you to use an alternative locking policy on all network file systems supported by Apple. This extension works well as long as all processes Access database files in a unified manner. Unfortunately, these locking mechanisms are independent of each other. If one process uses the AFP lock and the other uses the dot-file lock, the two processes may conflict, because the AFP lock cannot prohibit the dot file lock, and vice versa.

9.2. Incomplete disk flushing operations

As you can see in Sections 3.7 and 3.10, SQLITE needs to cache the system to the disk. On unix systems, fsync () is used for calling, while FlushFileBuffers () is used for windows (). However, the reports we received show that these interfaces on many systems are not as good as advertising. We have heard that in some windows versions, you can completely disable FlushFileBuffers () by modifying the Registry. In some historical linux versions, fsync is nothing but an empty operation. We also know that, even on systems where FlushFileBuffers () or fsync () can work normally, the IDE disk controller often keeps data in its own cache, lie that the data has reached the disk surface.

ON Apple's system, If you enable the fullsync option (PRAGMAfullsync = ON), it can ensure that the data is indeed flushed to the disk. Fullsync itself is slow, and the implementation of fullsync also needs to reset the disk controller, which will slow down other non-related disk I/O, so we do not recommend that you do this.

9.3. Only half of files are deleted.

SQLITE assumes that file deletion is an atomic operation from the perspective of a user program. If the file is deleted and power is down, SQLITE expects that the file does not exist or is a complete file that is exactly the same as before. If the operating system cannot do this, the transaction may not be atomic.

9.4. Spam in the file

SQLITE database files are common files. Other user programs can open them and write data to them. Some rogue programs may do this. Spam data may also come from operating system or disk controller bugs, especially those that will be triggered when power is down. SQLITE is powerless to address such problems.

9.5. delete or rename a hot Log File

If a power failure or crash occurs and a hot log file is generated, the names of the two files cannot be changed until another SQLITE process opens it and the database file and rolls it back. In Step 4.2, SQLITE searches for a hot log file in the directory where the opened database file is located. The file name is derived from the database file name. Therefore, as long as any one of these two files is removed or renamed, hot logs will not be found and will not be rolled back.

We think that the failure mode of SQLITE recovery is generally like this: power loss occurs; after power recovery, a kind of user or system administrator starts to check the loss; they found that there was a name named "important. data files, they may be very familiar with this file, so they did not perform any operations; but after the crash, there is a disk named "important. data-journal "Hot log file, the user deleted it because they think this file is garbage in the system. The only way to prevent such incidents may be to enhance user education.

If multiple links (hard links or symbolic links) direct to a database file, the generated log file is named based on the Link name used to open the database file. If a crash occurs and another link is used when the database is opened next time, the rollback will not be performed because the hot log file cannot be found.

In some cases, power loss may cause file system errors and the new file name cannot be recorded. In this case, the file will be moved to the "/lost + found" directory. To prevent such errors, SQLITE will open and synchronize the directory where the file is located while synchronizing the log file. However, the operations of creating other files in the directory where the database files are located may also cause the files to be moved to "/lost + found, this is beyond the control of SQLITE, so SQLITE has no way to do it. If you are using a file system with such namespace that is vulnerable to corruption (we believe most modern log file systems do not have this problem ), we recommend that you put the SQLITE database file in a separate subdirectory.

10. Summary and prospects

No matter in the past or the present, there are always some failures in the SQLITE atomic commit mechanism, and developers have to make some patches for this. However, this kind of thing has been happening fewer and fewer, and the failure mode has become increasingly obscure. However, it would be silly to think That SQLITE's atomic commit logic is already impeccable. What developers can promise is to fix new bugs as quickly as possible.

At the same time, we are looking for new methods to optimize this commit mechanism. On Linux, MacOSX, and windows, the current VFS implementation has made pessimistic assumptions. Maybe, after talking to some experts familiar with the operating principles of these systems, we can relax some restrictions to make them run faster. In particular, we guess most modern file systems already have the "secure APPEND" and "Atomic sector write" features, but we will still keep the worst assumptions conservative before confirmation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.