Redis Source code parsing (15)---aof-append only file parsing

Last Update:2014-10-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

continue to learn the Redis source data related file code analysis, today I see a file called AoF, the letter is append only file abbreviation, meaning only append file operation. In order to record the change of data operation, the data of this file is recovered in the case of abnormal situation. Similar to what I said earlier about the role of the Redo,undo log. As we all know, Redis as a memory database, each operation of the data change is put in memory first, wait until the memory data is full, in the flush to disk file, to achieve the purpose of persistence. So the AOF mode of operation is also adopted in this way. This introduces the concept of block blocks, which is actually a block of buffers. Some definitions of blocks are as follows:

/ * Some of the following code of AOF uses a simple buffer cache block for storage, and stores some change operation records of data. Wait until
When the buffer reaches a certain data size, it is persistently written to a file. The method used by redis is append. This means that
Each append must adjust the size of the stored block, but it is impossible to have an infinite block space, so redis introduced the concept of block list here,
Set the size of a dead block, which exceeds the unit block size, and store it in another block. Here we define the size of each block as 10M. * /
#define AOF_RW_BUF_BLOCK_SIZE (1024 * 1024 * 10) / * 10 MB per block * /

/ * Standard aof file read / write block * /
typedef struct aofrwblock {
// How much of the current file block is used, the free size
     unsigned long used, free;
     // Specific storage content, size 10M
     char buf [AOF_RW_BUF_BLOCK_SIZE];
} aofrwblock;

In other words, the size of each block by default is 10M, this big novel is not big, said small, if the data entered beyond the length, the system will dynamically request a new buffer block, on the server side is the form of a block list, organize the entire block:

/ * Append data to the AOF rewrite buffer, allocating new blocks if needed. * /
/ * Append data to the buffer. If the space is exceeded, a new buffer block will be requested. * /
void aofRewriteBufferAppend (unsigned char * s, unsigned long len) {
    listNode * ln = listLast (server.aof_rewrite_buf_blocks);
    // Position to the last block of the buffer, and perform additional write operations on the last block
    aofrwblock * block = ln? ln-> value: NULL;

    while (len) {
        / * If we already got at least an allocated block, try appending
         * at least some piece into it. * /
        if (block) {
        // If the remaining free of the current buffer block can support the content of len length, write directly
            unsigned long thislen = (block-> free <len)? block-> free: len;
            if (thislen) {/ * The current block is not already full. * /
                memcpy (block-> buf + block-> used, s, thislen);
                block-> used + = thislen;
                block-> free-= thislen;
                s + = thislen;
                len-= thislen;
            }
        }

        if (len) {/ * First block to allocate, or need another block. * /
            int numblocks;
// If it is not enough, it needs to be newly created and written
            block = zmalloc (sizeof (* block));
            block-> free = AOF_RW_BUF_BLOCK_SIZE;
            block-> used = 0;
            // Add the buffer block to the buffer list of the server
            listAddNodeTail (server.aof_rewrite_buf_blocks, block);

            / * Log every time we cross more 10 or 100 blocks, respectively
             * as a notice or warning. * /
            numblocks = listLength (server.aof_rewrite_buf_blocks);
            if (((numblocks + 1)% 10) == 0) {
                int level = ((numblocks + 1)% 100) == 0? REDIS_WARNING:
                                                         REDIS_NOTICE;
                redisLog (level, "Background AOF buffer size:% lu MB",
                    aofRewriteBufferSize () / (1024 * 1024));
            }
        }
    }
}

The following method is called when you want to proactively flush data from a buffer into a persisted disk:

/ * Write the append only file buffer on disk.
 *
 * Since we are required to write the AOF before replying to the client,
 * and the only way the client socket can get a write is entering when the
 * the event loop, we accumulate all the AOF writes in a memory
 * buffer and write it on disk using this function just before entering
 * the event loop again.
 *
 * About the 'force' argument:
 *
 * When the fsync policy is set to 'everysec' we may delay the flush if there
 * is still an fsync () going on in the background thread, since for instance
 * on Linux write (2) will be blocked by the background fsync anyway.
 * When this happens we remember that there is some aof buffer to be
 * flushed ASAP, and will try to do that in the serverCron () function.
 *
 * However if force is set to 1 we'll write regardless of the background
 * fsync. * /
#define AOF_WRITE_LOG_ERROR_RATE 30 / * Seconds between errors logging. * /
/ * Refresh the contents of the cache to disk * /
void flushAppendOnlyFile (int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;
    mstime_t latency;

    if (sdslen (server.aof_buf) == 0) return;

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = bioPendingJobsOfType (REDIS_BIO_AOF_FSYNC)! = 0;

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&! force) {
        / * With this append fsync policy we do background fsyncing.
         * If the fsync is still in progress we can try to delay
         * the write for a couple of seconds. * /
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {
                / * No previous write postponinig, remember that we are
                 * postponing the flush and return. * /
                server.aof_flush_postponed_start = server.unixtime;
                return;
            } else if (server.unixtime-server.aof_flush_postponed_start <2) {
                / * We were already waiting for fsync to finish, but for less
                 * than two seconds this is still ok. Postpone again. * /
                return;
            }
            / * Otherwise fall trough, and go write since we can't wait
             * over two seconds. * /
            server.aof_delayed_fsync ++;
            redisLog (REDIS_NOTICE, "Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.");
        }
    }
    / * We want to perform a single write. This should be guaranteed atomic
     * at least if the filesystem we are writing is a real physical one.
     * While this will save us against the server being killed I don't think
     * there is much to do about the whole server stopping for power problems
     * or alike * /

// During the write operation, I also listened to the delay
    latencyStartMonitor (latency);
    nwritten = write (server.aof_fd, server.aof_buf, sdslen (server.aof_buf));
    latencyEndMonitor (latency);
    / * We want to capture different events for delayed writes:
     * when the delay happens with a pending fsync, or with a saving child
     * active, and when the above two conditions are missing.
     * We also use an additional event name to save all samples which is
     * useful for graphing / monitoring purposes. * /
    if (sync_in_progress) {
        latencyAddSampleIfNeeded ("aof-write-pending-fsync", latency);
    } else if (server.aof_child_pid! = -1 || server.rdb_child_pid! = -1) {
        latencyAddSampleIfNeeded ("aof-write-active-child", latency);
    } else {
        latencyAddSampleIfNeeded ("aof-write-alone", latency);
    }
    latencyAddSampleIfNeeded ("aof-write", latency);

    / * We performed the write so reset the postponed flush sentinel to zero. * /
    server.aof_flush_postponed_start = 0;

    if (nwritten! = (signed) sdslen (server.aof_buf)) {
        static time_t last_write_error_log = 0;
        int can_log = 0;

        / * Limit logging rate to 1 line per AOF_WRITE_LOG_ERROR_RATE seconds. * /
        if ((server.unixtime-last_write_error_log)> AOF_WRITE_LOG_ERROR_RATE) {
            can_log = 1;
            last_write_error_log = server.unixtime;
        }

        / * Lof the AOF write error and record the error code. * /
        if (nwritten == -1) {
            if (can_log) {
                redisLog (REDIS_WARNING, "Error writing to the AOF file:% s",
                    strerror (errno));
                server.aof_last_write_errno = errno;
            }
        } else {
            if (can_log) {
                redisLog (REDIS_WARNING, "Short write while writing to"
                                       "the AOF file: (nwritten =% lld,"
                                       "expected =% lld)",
                                       (long long) nwritten,
                                       (long long) sdslen (server.aof_buf));
            }

            if (ftruncate (server.aof_fd, server.aof_current_size) == -1) {
                if (can_log) {
                    redisLog (REDIS_WARNING, "Could not remove short write"
                             "from the append -only file. Redis may refuse "
                             "to load the AOF the next time it starts."
                             "ftruncate:% s", strerror (errno));
                }
            } else {
                / * If the ftrunacate () succeeded we can set nwritten to
                 * -1 since there is no longer partial data into the AOF. * /
                nwritten = -1;
            }
            server.aof_last_write_errno = ENOSPC;
        }

        / * Handle the AOF write error. * /
        if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
            / * We can't recover when the fsync policy is ALWAYS since the
             * reply for the client is already in the output buffers, and we
             * have the contract with the user that on acknowledged write data
             * is synched on disk. * /
            redisLog (REDIS_WARNING, "Can't recover from AOF write error when the AOF fsync policy is 'always'. Exiting ...");
            exit (1);
        } else {
            / * Recover from failed write leaving data into the buffer. However
             * set an error to stop accepting writes as long as the error
             * condition is not cleared. * /
            server.aof_last_write_status = REDIS_ERR;

            / * Trim the sds buffer if there was a partial write, and there
             * was no way to undo it with ftruncate (2). * /
            if (nwritten> 0) {
                server.aof_current_size + = nwritten;
                sdsrange (server.aof_buf, nwritten, -1);
            }
            return; / * We'll try again on the next call ... * /
        }
    } else {
        / * Successful write (2). If AOF was in error state, restore the
         * OK state and log the event. * /
        if (server.aof_last_write_status == REDIS_ERR) {
            redisLog (REDIS_WARNING,
                "AOF write error looks solved, Redis can write again.");
            server.aof_last_write_status = REDIS_OK;
        }
    }
    server.aof_current_size + = nwritten;

    / * Re-use AOF buffer when it is small enough. The maximum comes from the
     * arena size of 4k minus some overhead (but is otherwise arbitrary). * /
    if ((sdslen (server.aof_buf) + sdsavail (server.aof_buf)) <4000) {
        sdsclear (server.aof_buf);
    } else {
        sdsfree (server.aof_buf);
        server.aof_buf = sdsempty ();
    }

    / * Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
     * children doing I / O in the background. * /
    if (server.aof_no_fsync_on_rewrite &&
        (server.aof_child_pid! = -1 || server.rdb_child_pid! = -1))
            return;

    / * Perform the fsync if needed. * /
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        / * aof_fsync is defined as fdatasync () for Linux in order to avoid
         * flushing metadata. * /
        latencyStartMonitor (latency);
        aof_fsync (server.aof_fd); / * Let's try to get this data on the disk * /
        latencyEndMonitor (latency);
        latencyAddSampleIfNeeded ("aof-fsync-always", latency);
        server.aof_last_fsync = server.unixtime;
    } else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
                server.unixtime> server.aof_last_fsync)) {
        if (! sync_in_progress) aof_background_fsync (server.aof_fd);
        server.aof_last_fsync = server.unixtime;
    }
}

Of course, there are operations on all data in the database, do the operation record, cheap use this file for overall recovery:

/ * Write a sequence of commands able to fully rebuild the dataset into
 * "filename". Used both by REWRITEAOF and BGREWRITEAOF.
 *
 * In order to minimize the number of commands needed in the rewritten
 * log Redis uses variadic commands when possible, such as RPUSH, SADD
 * and ZADD. However at max REDIS_AOF_REWRITE_ITEMS_PER_CMD items per time
 * are inserted using a single command. * /
/ * Rewrite the contents of the database to the file completely according to the key values * /
int rewriteAppendOnlyFile (char * filename) {
    dictIterator * di = NULL;
    dictEntry * de;
    rio aof;
    FILE * fp;
    char tmpfile [256];
    int j;
    long long now = mstime ();

    / * Note that we have to use a different temp name here compared to the
     * one used by rewriteAppendOnlyFileBackground () function. * /
    snprintf (tmpfile, 256, "temp-rewriteaof-% d.aof", (int) getpid ());
    fp = fopen (tmpfile, "w");
    if (! fp) {
        redisLog (REDIS_WARNING, "Opening the temp file for AOF rewrite in rewriteAppendOnlyFile ():% s", strerror (errno));
        return REDIS_ERR;
    }

    rioInitWithFile (& aof, fp);
    if (server.aof_rewrite_incremental_fsync)
        rioSetAutoSync (& aof, REDIS_AOF_AUTOSYNC_BYTES);
    for (j = 0; j <server.dbnum; j ++) {
        char selectcmd [] = "* 2 \ r \ n $ 6 \ r \ nSELECT \ r \ n";
        redisDb * db = server.db + j;
        dict * d = db-> dict;
        if (dictSize (d) == 0) continue;
        di = dictGetSafeIterator (d);
        if (! di) {
            fclose (fp);
            return REDIS_ERR;
        }

        / * SELECT the new DB * /
        if (rioWrite (& aof, selectcmd, sizeof (selectcmd) -1) == 0) goto werr;
        if (rioWriteBulkLongLong (& aof, j) == 0) goto werr;

        / * Iterate this DB writing every entry * /
        // Iterate through each record in the database for log records
        while ((de = dictNext (di))! = NULL) {
            sds keystr;
            robj key, * o;
            long long expiretime;

            keystr = dictGetKey (de);
            o = dictGetVal (de);
            initStaticStringObject (key, keystr);

            expiretime = getExpire (db, & key);

            / * If this key is already expired skip it * /
            if (expiretime! = -1 && expiretime <now) continue;

            / * Save the key and associated value * /
            if (o-> type == REDIS_STRING) {
                / * Emit a SET command * /
                char cmd [] = "* 3 \ r \ n $ 3 \ r \ nSET \ r \ n";
                if (rioWrite (& aof, cmd, sizeof (cmd) -1) == 0) goto werr;
                / * Key and value * /
                if (rioWriteBulkObject (& aof, & key) == 0) goto werr;
                if (rioWriteBulkObject (& aof, o) == 0) goto werr;
            } else if (o-> type == REDIS_LIST) {
                if (rewriteListObject (& aof, & key, o) == 0) goto werr;
            } else if (o-> type == REDIS_SET) {
                if (rewriteSetObject (& aof, & key, o) == 0) goto werr;
            } else if (o-> type == REDIS_ZSET) {
                if (rewriteSortedSetObject (& aof, & key, o) == 0) goto werr;
            } else if (o-> type == REDIS_HASH) {
                if (rewriteHashObject (& aof, & key, o) == 0) goto werr;
            } else {
                redisPanic ("Unknown object type");
            }
            / * Save the expire time * /
            if (expiretime! = -1) {
                char cmd [] = "* 3 \ r \ n $ 9 \ r \ nPEXPIREAT \ r \ n";
                if (rioWrite (& aof, cmd, sizeof (cmd) -1) == 0) goto werr;
                if (rioWriteBulkObject (& aof, & key) == 0) goto werr;
                if (rioWriteBulkLongLong (& aof, expiretime) == 0) goto werr;
            }
        }
        dictReleaseIterator (di);
    }

    / * Make sure data will not remain on the OS's output buffers * /
    if (fflush (fp) == EOF) goto werr;
    if (fsync (fileno (fp)) == -1) goto werr;
    if (fclose (fp) == EOF) goto werr;

    / * Use RENAME to make sure the DB file is changed atomically only
     * if the generate DB file is ok. * /
    if (rename (tmpfile, filename) == -1) {
        redisLog (REDIS_WARNING, "Error moving temp append only file on the final destination:% s", strerror (errno));
        unlink (tmpfile);
        return REDIS_ERR;
    }
    redisLog (REDIS_NOTICE, "SYNC append only file rewrite performed");
    return REDIS_OK;

werr:
    fclose (fp);
    unlink (tmpfile);
    redisLog (REDIS_WARNING, "Write error writing append only file on disk:% s", strerror (errno));
    if (di) dictReleaseIterator (di);
    return REDIS_ERR;
}

The system also opens the background for this method operation:

/ * This is how rewriting of the append only file in background works:
  *
  * 1) The user calls BGREWRITEAOF
  * 2) Redis calls this function, that forks ():
  * 2a) the child rewrite the append only file in a temp file.
  * 2b) the parent accumulates differences in server.aof_rewrite_buf.
  * 3) When the child finished '2a' exists.
  * 4) The parent will trap the exit code, if it's OK, will append the
  * data accumulated into server.aof_rewrite_buf into the temp file, and
  * Finally will rename (2) the temp file in the actual file name.
  * The the new file is reopened as the new append only file. Profit!
  * /
/ * Write AOF data file in the background * /
int rewriteAppendOnlyFileBackground (void)

The principle is the same as yesterday's analysis, with fork (), create a sub-thread, and finally open the API:

/ * API in aof.c * /
void aofRewriteBufferReset (void) / * release the old buffer in the server and create a new buffer * /
unsigned long aofRewriteBufferSize (void) / * Returns the total size of the current AOF buffer * /
void aofRewriteBufferAppend (unsigned char * s, unsigned long len) / * Append data to the buffer, if it exceeds the space, it will apply for a new buffer block * /
ssize_t aofRewriteBufferWrite (int fd) / * Write the buffer contents in the memory to the file, which is also a block-by-block write * /
void aof_background_fsync (int fd) / * start background thread for file synchronization operation * /
void stopAppendOnly (void) / * Stop the append data operation, here is a command mode * /
int startAppendOnly (void) / * Turn on append mode * /
void flushAppendOnlyFile (int force) / * flush the contents of the buffer area to disk * /
sds catAppendOnlyGenericCommand (sds dst, int argc, robj ** argv) / * Wrap parameters according to the input string and output again * /
sds catAppendOnlyExpireAtCommand (sds buf, struct redisCommand * cmd, robj * key, robj * seconds) / * Convert all expired commands to PEXPIREAT commands, and convert time to absolute time * /
void feedAppendOnlyFile (struct redisCommand * cmd, int dictid, robj ** argv, int argc) / * Perform different conversions of commands according to different operations of cmd * /
struct redisClient * createFakeClient (void) / * The command is always executed by the client, so the client method * /
void freeFakeClientArgv (struct redisClient * c) / * Free client parameter operation * /
void freeFakeClient (struct redisClient * c) / * Free client parameter operation * /
int loadAppendOnlyFile (char * filename) / * load AOF file content * /
int rioWriteBulkObject (rio * r, robj * obj) / * write bulk objects, divided into LongLong objects, and ordinary String objects * /
int rewriteListObject (rio * r, robj * key, robj * o) / * write List object, divided into ZIPLIST compressed list and LINEDLIST ordinary linked list operation * /
int rewriteSetObject (rio * r, robj * key, robj * o) / * write set object data * /
int rewriteSortedSetObject (rio * r, robj * key, robj * o) / * write the sorted set object * /
static int rioWriteHashIteratorCursor (rio * r, hashTypeIterator * hi, int what) / * writes the object currently pointed to by the hash iterator * /
int rewriteHashObject (rio * r, robj * key, robj * o) / * write a hash dictionary object * /
int rewriteAppendOnlyFile (char * filename) / * Fully rewrite the contents of the database to the file again according to the key value * /
int rewriteAppendOnlyFileBackground (void) / * Write AOF data file in the background * /
void bgrewriteaofCommand (redisClient * c) / * background write AOF file operation command mode * /
void aofRemoveTempFile (pid_t childpid) / * Remove the aof file produced by a child thread ID childpid * /
void aofUpdateCurrentSize (void) / * Update the size of the current aof file * /
void backgroundRewriteDoneHandler (int exitcode, int bysignal) / * callback method after the background child thread write operation is completed * /

Redis Source code parsing (15)---aof-append only file parsing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Redis Source code parsing (15)---aof-append only file parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Redis Source code parsing (15)---aof-append only file parsing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support