Redis aof
We have introduced the main framework of redis and two general principles of persistence. In this article, we will analyze the implementation of redis aof from the source code perspective. (This article is based on version 2.4.2)
1. Related configuration items
First, let's take a look at the aof configuration options in redis. conf:
Appendonly (yes, no): whether to enable aof persistence
Appendfilename (log/appendonly. AOF): aof Log File
Appendfsync (always, everysec, no): aof log file synchronization frequency. Always indicate that fsync is performed for each write. everysec performs fsync once per second. No fsync is not performed by the OS.
No-appendfsync-on-Rewrite (yes, no): whether fsync is required for rewrite.
Auto-Aof-rewrite-percentage (100): When the proportion of aof files increases (this is doubled), the background rewrite runs automatically.
Auto-Aof-rewrite-Min-size (64 MB): Minimum aof file size required by rewrite. These two options jointly determine whether the subsequent rewrite process has reached the running time
Note: When aof is large, rewrite deletes the original aof file by overwriting the data in the memory to generate the aof log of the latest memory data, that is, the current result is reversed to the corresponding operation command and written to the aof file.
With the above options, we can know the three aof processing processes of redis:
Aof write operations performed by each update operation (involving synchronization frequency)
Rewrite: When auto-Aof-rewrite-percentage and auto-Aof-rewrite-Min-size are met, the rewrite operation is automatically executed.
Rewrite. When you receive the bgrewriteaof Client Command, run the subsequent rewrite operation immediately.
Note: aof will also be written when a key expires. In fact, it is similar to the first one and will not be introduced here. The three processes are described below.
Two new sub-processes are added in the later version of redis (I don't know which version to start:
Redis_bio_close_file, responsible for all close file operations
Redis_bio_aof_fsync, responsible for fsync operations
These two operations may cause blocking. If the two operations are completed in the main thread, the system's response to the event will be affected. Therefore, the corresponding thread is used to complete these operations, each thread has its own bio_jobs list to save the job to be processed. The corresponding code is in bio. C (the thread processing function is bioprocessbackgroundjobs). The two threads create bioinit () in initserver ().
Note: Standard Command Format: for example, set AAA Xiang
* 3 \ r \ N $3 \ r \ NSET \ r \ N $3 \ r \ n \ AAA \ r \ N $5 \ r \ n \ Xiang \ r \ n
* 3 indicates the number of parameters of the command, and the following number indicates the length of each parameter.
2. aof processing process
2.1 aof write for each update operation
The configuration mainly involves: appendfsync, no-appendfsync-on-rewrite. The operation entry is in (redis. C ):
Void call (redisclient * c) {dirty = server. dirty; // The Last dirty Data Count C-> cmd-> proc (c); // Execute Command operations. If this operation is an update operation, the server. dirty adds dirty = server. dirty-dirty; // Number of dirty data caused by this execution... If (server. appendonly & dirty> 0) // dirty data is available and the aof function feedappendonlyfile (c-> cmd, C-> DB-> ID, C-> argv, c-> argc); // Save the data to the server. aofbuf ...}
Let's take a look at the implementation of feedappendonlyfile.
Void feedappendonlyfile (struct rediscommand... {If (dictid! = Server. appendseldb) {// the DB operated in the current month is different from the previous one, so you need to re-write a new select dB command, appendseldb is also set to-1 Buf = sdscatprintf (BUF, "* 2 \ r \ N $6 \ r \ nselect \ r \ N $ % lu \ r \ n % s \ r \ n", (unsigned long) strlen (seldb), seldb); server. appendseldb = dictid ;}... Buf = catappendonlygenericcommand (BUF, argc, argv); // convert to the standard Command Format server. aofbuf = sdscatlen (server. aofbuf, Buf, sdslen (BUF); // write the command to aofbuf. This Buf will fsync to the file if (server. bgrewritechildpid! =-1) // If a bgrewrite sub-process exists, you must also save the command to bgrewritebuf so that when the sub-process ends, append a new change to the file server after rewrite. bgrewritebuf = sdscatlen (server. bgrewritebuf, Buf, sdslen (BUF ));...}
As you can see, the above aof operation is only written to Buf, and no write operation is performed. Next we will view the process. By viewing the code, we can know that the flushappendonlyfile () function is a real write operation. In addition, we can know that this function will be called in beforesleep and servercron. Beforesleep is an aemain loop, which must be called once before each event processing:
void aeMain(aeEventLoop *eventLoop) { eventLoop->stop = 0; while (!eventLoop->stop) { if (eventLoop->beforesleep != NULL) eventLoop->beforesleep(eventLoop); aeProcessEvents(eventLoop, AE_ALL_EVENTS); }}
Servercron first checks whether there is any delayed flush operation:
If (server. aof_flush_postponed_start) flushappendonlyfile (0 );
Next let's take a look at this function flushappendonlyfile:
Flushappendonlyfile (INT force ){... If (server. appendfsync = appendfsync_everysec) // If the fsync frequency we set is everysec sync_in_progress = biopendingjobsoftype (redis_bio_aof_fsync )! = 0; // determine whether a fsync job is waiting for fsync thread processing if (server. appendfsync = appendfsync_everysec &&! Force) {If (sync_in_progress) {// there is already a fsync job waiting for processing. This job is neither written nor put into the fsync thread processing queue, if the job has not delayed fsync before, it indicates that there is already such a situation and this time is set to server. unixtime. if a delayed fsync job already exists, if the delay is less than 2 s, the system will return and wait. Otherwise, flush is required. If (server. aof_flush_postponed_start = 0) {server. aof_flush_postponed_start = server. unixtime; return;} else if (server. unixtime-server. aof_flush_postponed_start <2) {return;} redislog (redis_notice, "Asynchronous aof fsync is taking too long (disk is busy ?). Writing the aof buffer without waiting for fsync to complete, this may be slow down redis. ");} server. aof_flush_postponed_start = 0; nwritten = write (server. appendfd, server. aofbuf, sdslen (server. aofbuf); // write Buf is a nonblock. If it is not synchronized to the disk by fsync, if (server. no_appendfsync_on_rewrite & (server. bgrewritechildpid! =-1 | Server. bgsavechildpid! =-1) return; // judge the no_appendfsync_on_rewrite condition if (server. appendfsync = appendfsync_always) {// if it is appendfsync_always, you must call fsync immediately. In this case, the main thread will be blocked by aof_fsync (server. appendfd);/* let's try to get this data on the disk */server. lastfsync = server. unixtime;} else if (server. appendfsync = appendfsync_everysec & server. unixtime> server. lastfsync) {// Add the job to the Job Queue of the fsync thread if (! Sync_in_progress) aof_background_fsync (server. appendfd); server. lastfsync = server. unixtime ;}}
Through the above introduction, we can know that even if appendfsync is set to alway, aof file is not directly written every time an update command is executed (write + fsync), this process (write + fsync) will be postponed until the event processing process ends after beforesleep (a question is written to the server first. aofbuf, and then write it to the data file. Will Crash lose data? The answer is: no, because beforesleep is called for flash after an event processing is completed, and it is completed before the next event processing, that is, the client will be replied to successfully or not only after flash. note that this explanation comes from @ hoterran. is there a possibility of repeated writes ?); If fsync already exists during beforesleep
Job is waiting for the fsync thread to process (there is only one aof FD, and you are still wondering why it cannot be stored in the list), if (server. appendfsync = appendfsync_everysec &&! Force) & if (sync_in_progress), the request will be marked as server. aof_flush_postponed_start, then flushappendonlyfile will be called again when servercron is called to check whether write can be performed now and submit the job to the fsync thread, or if it has been waiting for more than 2 s, A system prompt is displayed. [Like everysec, it's not really every 1 s
Fsync once]
2.2 automatically run rewrite later
Configuration involved in this operation: auto-Aof-rewrite-percentage, auto-Aof-rewrite-Min-size.
This process is determined in the servercron and is the time to run bgrewrite:
Servercron () {If (server. bgsavechildpid! =-1 | Server. bgrewritechildpid! =-1) {} else {... // Determine whether rdbsavebackground is required, and then run save RDB if (server. bgsavechildpid =-1 & server. bgrewritechildpid =-1 & server. auto_aofrewrite_perc & server. appendonly_current_size> server. auto_aofrewrite_min_size) {// There are no subsequent rewrite sub-processes, and auto_aofrewrite_min_size long base = server. auto_aofrewrite_base_size? Server. auto_aofrewrite_base_size: 1; long growth = (server. appendonly_current_size * 100/base)-100; If (growth> = server. auto_aofrewrite_perc) {// determine the growth ratio redislog (redis_notice, "Starting automatic rewriting of aof on % LLD % growth", growth); rewriteappendonlyfilebackground ();}}}}
The rewriteappendonlyfilebackground () function also appears in the following situations, so we will analyze it together below.
2.3 The client sends the bgrewriteaof command
By searching the readonlycommandtable table, we can see that when the client sends the bgrewriteaof command, the server calls the bgrewriteaofcommand function for processing. This function will determine whether bgrewritechildpid exists or bgsavechildpid exists. It indicates server. aofrewrite_scheduled = 1. bgrewrite is required, but not now, but when servercron is processed. Otherwise, call rewriteappendonlyfilebackground to create a bgrewrite process and perform the rewrite operation.
Rewriteappendonlyfilebackground () {If (childpid = fork () = 0) {// The background sub-process if (server. ipfd> 0) Close (server. ipfd); // close the listen socket If (server. sofd> 0) Close (server. sofd); snprintf (tmpfile, 256, "Temp-rewriteaof-BG-% d. aof ", (INT) getpid (); // The New aof temporary file name. A new tempfile name if (rewriteappendonlyfile (tmpfile) is used in rewriteappendonlyfile) = redis_ OK )... // Rewrite and write to the new tempfile} else {server. aofrewrite_scheduled = 0; // The sub-process has been scheduled by the server. bgrewritechildpid = childpid; // updatedictresizepolicy () indicates whether a rewrite sub-process exists; // disable resize dict server at this time. appendseldb =-1; // For the next update operation to write the select dB command return redis_ OK ;}}
Next, let's take a look at how the process completes the job:
Rewriteappendonlyfile (char * filename) {snprintf (tmpfile, 256, "Temp-rewriteaof-% d. aof ", (INT) getpid (); // open a new tempfilefp = fopen (tmpfile," W "); For (j = 0; j <server. dbnum; j ++) {// traverse all database char selectcmd [] = "* 2 \ r \ N $6 \ r \ nselect \ r \ n "; // write the select dB command redisdb * DB = server for each database. DB + J; If (fwrite (selectcmd, sizeof (selectcmd)-1, 1, FP) = 0) goto werr; If (fwritebulklonglong (FP, j) = 0) goto werr; // dB I D While (DE = dictnext (DI ))! = NULL) {// obtain each dictentry keystr = dictgetentrykey (DE) In the DB; // obtain the key value o = dictgetentryval (de ); // obtain the value initstaticstringobject (Key, keystr); // convert the keystr to the robj type, if (o-> type = redis_string) {// The following is the type of value to be judged one by one, to select the corresponding command and encoding mode, here we will take the redis_string type as an example char cmd [] = "* 3 \ r \ N $3 \ r \ NSET \ r \ n "; // first write the command if (fwrite (CMD, sizeof (CMD)-, FP) = 0) goto werr; If (fwritebulkobject (FP, & Key) = 0) goto werr; // Write key if (fwritebulkobject (FP, O) = 0) goto werr; // write value} else if (...) Else if...} Fflush (FP); // fsync file and close aof_fsync (fileno (FP); fclose (FP); rename (tmpfile, filename ); // rename it as temp-rewriteaof-BG-% d. aof name. I don't understand why a new tmpfile: temp-rewriteaof-% d is used here. aof}
The child process has completed the rewrite operation. When does the parent process obtain the exit status of the child process and perform some operations on the main thread?
If (server. bgsavechildpid! =-1 | Server. bgrewritechildpid! =-1) {If (pid = wait3 (& statloc, wnohang, null ))! = 0) {If (pid = server. bgsavechildpid) {backgroundsavedonehandler (statloc); // save rdb process in the background} else {backgroundrewritedonehandler (statloc); // The background rewrite sub-process exits, call this function for processing} updatedictresizepolicy ();}}
That is, the parent process uses server. bgrewritechildpid in servercron to determine whether to wait for the child process to exit.
Further, let's take a look at the operations performed by backgroundrewritedonehandler: (Here we use some tips to solve some of the defects and problems of aof, which are worth looking)
Backgroundrewritedonehandler (INT statloc) {If (! Bysignal & exitcode = 0) {// determine the exit status snprintf (tmpfile, 256, "Temp-rewriteaof-BG-% d. aof ", (INT) server. bgrewritechildpid); newfd = open (tmpfile, o_wronly | o_append); // open the temporary file of the subprocess rewrite... Nwritten = write (newfd, server. bgrewritebuf, sdslen (server. bgrewritebuf); // put the tempfile written by bgrewritebuf in the tempfile/* When the rename is oldfile, and the file is not open, that is, no other process references it, in this case, rename will cause the unlink operation of the file, which will cause the main thread to be blocked. The solution here is to open (o_nonblock) first to increase the reference count, you don't have to worry about whether the file is successfully opened, because if the file does not exist, there will be no unlink issues. if the file has been opened, set oldfd =-1 and rename first. No unlink operation will be performed, and then the background thread will close the file, because close will cause unlink blocking */If (server. appendfd =-1) // If the oldfile File The client can send a command to disable aof oldfd = open (server. appendfilename, o_rdonly | o_nonblock); // increase the reference count of oldfile to prevent unlink blocking caused by rename. Else oldfd =-1; // here-1 is used for close when rename fails. Otherwise, the value below will be set to the old aof FD, and then close Rename (tmpfile, server. appendfilename); // The rename will not cause unlink if (server. appendfd =-1) {close (newfd); // if the current aof disable, close the new aof file} else {oldfd = server. appendfd; // restore the oldfd server. appe NDFD = newfd; // set newfd as the new aof FD if (server. appendfsync = appendfsync_always) aof_fsync (newfd); // directly fsync blocks else if (server. appendfsync = appendfsync_everysec) aof_background_fsync (newfd); // put the fsync to the server in the fsync thread queue. appendseldb =-1;/* Make sure select is re-issued */aofupdatecurrentsize (); server. auto_aofrewrite_base_size = server. appendonly_current_size; sdsfree (server. aofbuf); // clear aofbuf, Because these already exist in bgrewritebuf and are written to the current aof file server. aofbuf = sdsempty () ;}if (oldfd! =-1) biocreatebackgroundjob (redis_bio_close_file, (void *) (long) oldfd, null, null); // background close thread ...}}
The following article also explains the new version to solve some problems existing in the old aof: http://www.hoterran.info/redis-aof-backgroud-thread
3. Summary
Through this article, we learned the majority of aof content. The essence can be divided into two types: when the server receives an update operation, it writes this command to aofbuf, after an event loop is completed (beforesleep), perform the fsync operation. The configured sync frequency is set to direct (alawy) the main thread fsync or fsync thread to sync (everysec); the second is the rewrite operation, which is implemented by the background sub-process, the child process uses copy-on-write to obtain the same address space as the parent process, it restores the contents of all dict tables in all databases to a temporary file in the form of commands, and the parent process must cache new update operations to bgrewrietbuf, when the child process ends (the previous data has been written to the temporary file), the parent process appends the content in bgrewritebuf to the temporary file of the child process at servercron, the temporary file rename is the file name specified in the configuration file. This completes a rewrite and swap operation. In addition, the author is "painstaking" for the best performance of the main thread. Of course, this is what we want to see, because we can learn a lot of knowledge and skills from these changes.
References:
Http://www.hoterran.info/redis_persistence
Http://www.hoterran.info/redis-aof-backgroud-thread