Redis persistence mechanism

Last Update:2015-07-09 Source: Internet

Author: User

Tags lzf time in milliseconds

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Rdb vs aof two persistence modes, implementation principle" "RDB mode" fork a process, traverse hash table, and use copy on write to save the entire db dump. Save, shutdown, slave command will trigger this operation. Particle size ratio is large, if save, shutdown, slave before crash, then the middle of the operation has no way to recover. The "aof mode" writes the instructions to a log file in a similar way. (similar to exporting SQL from a database such as PostgreSQL, write-only) granularity is small, and after crash, only crash did not have time to log before the operation can not be restored. AOF mode: Continuous log record write operation, after crash with log recovery; RDB mode: Normally write operation does not trigger the write, only manually submit the Save command, or close the command, the backup operation is triggered. The standard of choice is to see if the system is willing to sacrifice some performance in exchange for higher cache consistency (AOF), or is willing to write frequently, do not enable backup in exchange for higher performance, when manually run save, then do Backup (RDB). RDB: Final consistency. "Decrypt Redis Persistence" "http://blog.nosqlfan.com/html/3813.html" "Write operation Flow" First we look at what the database did in the write operation, mainly the following five processes.

The client sends a write operation to the server (the data is in the client's memory)
The database server receives the data for the write request (the data is in the server's memory)
The server calls write (2), which writes the data to the disk (the data is in the buffer of the system memory)
The operating system transfers data from the buffer to the disk controller (data is in the disk cache)
The disk controller writes data to the physical media on the disk (the data actually falls on the disk)

The "Fault Analysis" write operation is roughly the above 5 processes, below we combine the above 5 processes to look at various levels of failure.

When the database system fails, this time the system kernel is OK, then as long as we finish the 3rd step, then the data is safe, because the subsequent operating system will be to complete the next few steps to ensure that the data will eventually fall on the disk.
When the system loses power, all of the caches mentioned in 5 above are invalidated and both the database and the operating system stop working. Therefore, only when the data in the completion of the 5th step, the machine power loss to ensure that the data is not lost, in the above four steps of the data will be lost.

From the above 5 steps, we may wish to clarify some of the following questions:

How long does the database call write (2) to write data to the kernel buffer
How long does the kernel write data from the system buffers to the disk controller
When does the disk controller write the data in the cache to the physical media?

For the first issue, the database level is usually fully controlled. For the second issue, the operating system has its default policy, but we can also force the operating system to write data from the kernel area to the disk controller via the Fsync Series command provided by the POSIX API. For the third problem, it seems that the database is inaccessible, but in fact, in most cases the disk cache is set to shut down. Or it is only turned on for read caching, which means that the write operation is not cached and written directly to the disk. The recommended approach is to turn on write caching only if your disk device has a backup battery. "Data corruption," the so-called data corruption, is the data can not be recovered, we are talking about how to ensure that the data is actually written to disk, but written to disk may not mean that the data is not corrupted. For example, we may write a request two different writes, and when it happens, it may cause one write to complete safely, but not yet another. If the data file structure of the database is not properly organized, it may result in a situation where the data is completely unrecoverable. There are also three strategies for organizing data to prevent data files from being corrupted to unrecoverable situations: the first is the most coarse processing, that is, the data can not be organized to ensure recoverability. Instead, by configuring the data Synchronous BackupThe data file is damaged and restored by a data backup. In fact MongoDB does not turn on the journaling log, which is the case when configuring replica sets. The other is to add a Operation Log, and remember the behavior of the operation each time, so that we can use the operation log for data recovery. Because the operation log is written in sequential append mode, there is no case that the operation log will not be recoverable. This is similar to the case where MongoDB opened the journaling log. The more insured approach is the database no modification of old data, just by appending the write operation, so that the data itself is a log, so there will never be the data can not recover the situation. In fact, COUCHDB is a good example of this approach. RDB Snapshot

The first persistence policy, an RDB snapshot. Redis supports persisting a snapshot of the current data into a data file. And how does a continuously written database generate a snapshot? Redis uses the copy on write mechanism of the fork Command . When a snapshot is generated, the current process is forked out of a child process, and then all data is looped through the child process, and the data is written to an RDB file.

We can configure the timing of the RDB snapshot generation through the Redis save instruction, for example, you can configure a snapshot to be generated 100 times within 10 minutes, or you can configure a snapshot to be generated with 1000 writes within 1 hours, or you can implement multiple rules together. The definitions of these rules are in the Redis configuration file, and you can set the rules at Redis runtime with Redis's config set command, without having to restart Redis.

The Redis Rdb file does not break because its write operation is performed in a new process, and when a new Rdb file is generated, the Redis-generated subprocess writes the data to a temporary file and then passes the atomic rename The system call renames the temporary file to an RDB file, so that any time a failure occurs, the Redis Rdb file is always available.

At the same time, Redis's Rdb file is also a part of the Redis master-Slave synchronization implementation.

However, we can obviously see that the RDB has his shortcomings, that is, once the database has a problem, then the data stored in our RDB file is not entirely new , from the last Rdb file generation to the Redis outage time of the data are all discarded. In some businesses, this is tolerable, and we recommend that these services be persisted using an RDB, because the cost of opening an RDB is not high. But for other applications that have very high data security requirements that cannot tolerate data loss, the RDB is powerless, so Redis introduces another important persistence mechanism: AOF logs.

AOF Log

The full name of the AOF log is append only file, which we can see from the name that it is an append-write log file . Unlike the binlog of a general database, the AoF file is a plain, recognizable text, and its content is a Redis standard command . For example, we do the following experiment, using the Redis2.6 version, in the start command parameters to set the open aof function:

./redis-server--appendonly Yes

Then we execute the following command:

redis 127.0.0.1:6379> set key1 Hello
OK
redis 127.0.0.1:6379> append key1 " World!"
(integer) 12
redis 127.0.0.1:6379> del key1
(integer) 1
redis 127.0.0.1:6379> del non_existing_key
(integer) 0

When we view the AoF log file, we will get the following content:

$ cat appendonly.aof
*2
$6
SELECT
$1
0
*3
$3
set
$4
key1
$5
Hello
*3
$6
append
$4
key1
$7
 World!
*2
$3
del
$4
key1

As you can see, the write operation generates a corresponding command as a log. It is noteworthy that the last del command, which is not recorded in the AoF log, is because Redis determines that this command does not make changes to the current data set. So there's no need to record this useless write command. In addition, the AOF log is not completely based on the client's request to generate the log , such as the command incrbyfloat in the AOF log is recorded as a set record, because the floating-point operation may be different on different systems, Therefore, in order to avoid the same log on different systems to generate different datasets, so here only the results of the operation through set to record .

AoF rewrite

You can think, every write command generates a log, then the aof file is not very large? The answer is yes, the AoF file will grow larger, so Redis provides a feature called AoF rewrite. Its function is to regenerate a copy of the aof file , one record in the new aof file is only once , and unlike an old file, multiple operations on the same value may be logged. Its build process is similar to an RDB, and it also fork a process, traversing the data directly, and writing a new aof temporary file . In the process of writing a new file, all of the write logs are still written to the old aof file and are also recorded in the memory buffer . When the completion of the operation completes, logs from all buffers are written to the temporary file once. Then call the atomic Rename command to replace the old aof file with the new AoF file.

From the above process we can see that both RDB and AOF operations are sequential IO operations with high performance. At the same time, when the database is restored through the Rdb file or the AOF log, the sequential read data is loaded into memory. So it does not cause random reads of the disk.

AOF Reliability Settings

AoF is a write-file operation that is intended to write the operation log to disk, so it will also encounter the 5 processes we have described above for the write operation. So how high is the operational security of writing AOF? In fact, this can be set, in Redis in the AOF call write (2) write, when the call Fsync write it to disk, through the appendfsync option to control, the following Appendfsync three settings, the security strength gradually stronger.

Appendfsync No

When setting Appendfsync to No, Redis does not actively invoke Fsync to synchronize aof log content to disk, so it is entirely dependent on the debug of the operating system. For most Linux operating systems, Fsync is performed every 30 seconds, and the data in the buffer is written to disk.

Appendfsync everysec

When setting Appendfsync to Everysec, Redis will default to a Fsync call every second, writing data from the buffer to disk. However, this time when the Fsync call is longer than 1 seconds. Redis takes a deferred fsync policy and waits another second. That is, in two seconds after the Fsync, this time Fsync no matter how long it will be carried out. Because the file descriptor is blocked at Fsync, the current write operation is blocked.

So, the conclusion is that in the vast majority of cases, Redis will be fsync every second. In the worst case, a fsync operation is performed in two seconds.

This operation is called Group commit in most database systems, which is the combination of multiple write operations and writes the logs to disk at once.

Appednfsync always

When the Appendfsync is set to always, every write operation calls a Fsync, and the data is the safest, and of course, the performance is affected because Fsync is executed every time.

What's the difference for pipelining?

For pipelining operations, the process is that the client sends n commands at a time, and then waits for the return result of the N commands to be returned together. The adoption of pipilining means that the return value of each command is discarded. Because in this case, n commands are executed during the same execution. So when setting Appendfsync to Everysec, there may be some deviations because the N commands can take longer than 1 seconds or even 2 seconds. However, it can be guaranteed that the maximum time will not exceed the execution time of the N commands.

Comparison with PostgreSQL and MySQL

This piece is not much to say, because the above operating system level of data security has been said a lot, so in fact, different databases in the implementation of the same. In short, the final conclusion is that, in the case of Redis open aof, its stand-alone data security is not weaker than these mature SQL databases.

Data import

What is the use of these persisted data, of course, for data recovery after a reboot. Redis is an in-memory database, either an RDB or a aof, that is just a measure of its data recovery. So Redis will read the RDB or the aof file and reload it into memory when it recovers with the RDB and aof. Compared to MySQL and other database startup time, the president is a lot, because MySQL would not need to load the data into memory.

However, in contrast, when MySQL is started to provide services, the hot data it accesses is also slowly loaded into memory, which is often called preheating, and its performance is not too high until the preheating is complete. The advantage of Redis is that data is loaded into memory at once and warmed up at once . This allows the service to be delivered very quickly as long as Redis is up and running.

There are some differences in the start-up time of using an RDB and using AOF. the RDB has a shorter start-up time for two reasons, one for each data in the Rdb file, and no more than one record of the data that may be logged as the AOF log. So every piece of data just needs to be written once. Another reason is that the format of the Rdb file is consistent with the encoding format of the Redis data in memory and does not require any further data coding. The CPU consumption is much smaller than the load of the AOF log .

Well, that's probably what it says here. For a more complete version, see the Redis author's blog post: Redis persistence demystified. In this article, if there is a description of the shortcomings, we correct.

"Redis RDB file Format full parse" "Http://blog.nosqlfan.com/html/3734.html"

The Rdb file is a way to make Redis persistent, and Redis dumps the in-memory data into an RDB file as mirroring, by making a good strategy. So what is the internal format of the Rdb file, what does Redis do to get the RDB to dump and load faster, and let's dive into the Rdb file to see its internal structure.

First, let's look at an overview of an RDB file:

---------------------------- # RDB files are binary, so there is no carriage return or line feed to separate each line.
52 45 44 49 53 # starts with the string "REDIS"
30 30 30 33 # RDB version number, big-endian storage, such as the left one means that the version number is 0003
----------------------------
FE 00 # FE = FE represents the database number, Redis supports multiple libraries, numbered by numbers, where 00 represents the 0th database
---------------------------- # Key-Value started to store
FD $ length-encoding # FD indicates the expiration time. The expiration time is stored using length encoding, as we will see later.
$ value-type # 1 byte for the type of value, such as set, hash, list, zset, etc.
$ string-encoded-key # Key value, encoded by string encoding, which will also be described later
$ encoded-value # Value value, using different encoding methods according to different Value types
----------------------------
FC $ length-encoding # FC represents the expiration time in milliseconds. The specific time later is stored using length encoding.
$ value-type # Same as above, also a byte value type
$ string-encoded-key # Key value also encoded with string encoding
$ encoded-value # Value is also encoded with the corresponding data type
----------------------------
$ value-type # The following is a Key-Value pair with no expiration time setting. To prevent conflicts, the data type will not start with FD, FC, FE, FF
$ string-encoded-key
$ encoded-value
----------------------------
FE $ length-encoding # Starting with the next library, the library number is encoded with length encoding
----------------------------
... # continue to store Key-Value pairs for this database
FF ## FF: End of RDB file sign

Below we explain the above content in detail

Magic number

The first line doesn't have to be said, Redis strings are used to identify the RDB files that are Redis

Version number

Uses 4 bytes to store the version number and store and read it in big endian

Database number

Start with a byte of 0xFE, followed by the specific number of the database, the database number is a number, "Length Encoding" way to encode the store, "length Encoding" We will talk about later.

Key-value value pairs

The value pairs include the following four sections

Key expiration time, this item is optional
A byte represents the type of value
The value of key, which is a string, is saved by "Redis string Encoding".
Value, using "Redis value Encoding" to make different storage based on different data types

Key Expiration Time

The expiration time is used at the beginning of the 0xFD or 0xFC for identification, indicating the second-level expiration and the millisecond-level expiration time, followed by a Unix timestamp, seconds, or milliseconds. The value of the timestamp is stored by the "Redis Length Encoding" encoding. During the import of an rdb file, the expiration time is determined to determine whether it has expired and needs to be ignored.

Value type

The value type is stored in one byte and currently includes some of the following values:

0 = "String Encoding"
1 = "List Encoding"
2 = "Set Encoding"
3 = "Sorted Set Encoding"
4 = "Hash Encoding"
9 = "Zipmap Encoding"
Ten = "Ziplist Encoding"
one = "Intset Encoding"
= "Sorted Set in Ziplist Encoding"

Key

The key value is a simple "String Encoding" encoding, which can be seen in the following description

Value

There are 9 types of value listed above, which can actually be divided into three main categories

Type = 0, simple string
Type is 9, 10, 11 or, the value string needs to be extracted after it is read out
The type is 1, 2, 3, or 4, and value is a string sequence that is used to construct the List,set,hash and Zset structures

Length Encoding

The above said a lot of Length Encoding, now for everyone to explain. Perhaps you would say that the length with an int storage is not OK? However, usually we use the length may not be small, an int 4 bytes is a bit wasteful. So Redis uses a variable-length encoding method that encodes different sizes of numbers into different lengths.

First, when the length is read, a byte of data is read, and the first two bits are used for the determination of the variable length encoding.
If the first two bits are 0 0, then the remaining 6 digits below indicate the specific length
If the first two bits are 0 1, then a byte of data will be read, plus the remaining 6 bits, a total of 14 bits to represent the specific length
If the first two bits are 1 0, then the remaining 6 bits are discarded and replaced by the next 4 bytes to indicate the specific length
If the first two bits are 1 1, then the following should be a special code, and the remaining 6 bits are used to identify the particular type of encoding. Special encodings are primarily used to store numbers as strings, or as encoded strings. See "String Encoding" for details

What is the benefit of doing this, actually saving space:

0–63 numbers require only one byte for storage
The 64–16383 number only needs two bytes for storage
The 16383-2^32-1 number only needs to be stored with 5 bytes (1-byte identifier plus 4-byte value)

String Encoding

The Redis String Encoding is binary safe, meaning that he does not have any special separators for separating the values, and you can store anything in it. It is a string of bytecode.

Here are the three types of String Encoding

Length-coded string
Numeric substitution strings: 8-bit, 16-bit, or 32-bit digits
LZF Compressed string

Length encoded string

The length-coded string is the simplest type, which consists of two parts, the length of the string encoded with "length Encoding", and the second part is the specific byte code.

Number substitution string

The above mentioned the Length Encoding special code, it is used here. So the number substitution string starts with 1 1, and then reads the remaining 6 bits of the byte, identifying the different number types according to different values:

0 means the following is a 8-digit number
1 means the following is a 16-digit number
2 means the following is a 32-digit number

LZF Compressed string

As with the data substitution string, it starts with 1 1, and then the remaining 6 bits, if the value is 4, indicates that it is a compressed string. The compressed string parsing rules are as follows:

First read the compression length by the length Encoding rule clen
Then read the uncompressed length by the length Encoding rule
Then read the second Clen.
After obtaining the above three information, the LZF algorithm is used to decode the byte code behind the Clen length.

List Encoding

The REDIS list structure is stored in an RDB file, which stores each element in the list in turn. The structure is as follows:

First, press length Encoding to read the list size
Then read the value of size String encoding
Then rebuild the list with these read size values to complete the

Set Encoding

The set structure, like the list structure, also stores the elements in sequence.

Sorted Set Encoding

Todo

Hash Encoding

First read the size of the hash structure by Length Encoding
Then read 2xsize string encoding strings (because a hash item includes key and value two items)
Resolves the 2xsize string read above to hash and key and value
The key value above is then stored in the hash structure

Zipmap Encoding

See the previous article: Redis Zipmap Memory Layout analysis

To be continued & welcome contributions

Ziplist Encoding

Intset Encoding

Sorted Set as Ziplist Encoding

Source: Rdb_file_format.textile

Redis persistence mechanism

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More