Redis design and implementation [PART I] Data structures and objects-C source Reading (II.)

Source: Internet
Author: User


iv. Jumping Table

Keywords: layer height random


The jump table supports the average O (logn), the worst O (N) Complexity of node lookups, and the sequential operation to batch nodes.



In most cases, the efficiency of the jump table can be comparable to the balance tree, because the implementation of the jumping table is more simple than the balance tree, so many programs use the Jumping table instead of the balance tree.



Redis uses a jumping table as one of the underlying implementations of an ordered set key, and Redis uses a jump table as the underlying implementation of an ordered set key if there is a larger number of elements in an ordered collection, or if the members of elements in an ordered collection are long strings.



Redis uses jump tables only in two places, one to implement an ordered set key, and the other to use it as an internal data structure in a cluster node.


Data structure Source code


The Redis jump table is defined by the Redis.h/zskiplistnode and redis.h/zskiplist two structures:

/ *
 * Jump table node
 * /
typedef struct zskiplistNode {

    // member object
    robj * obj;

    // points
    double score;

    // back pointer
    struct zskiplistNode * backward;

    // Floor
    struct zskiplistLevel {

        // forward pointer
        struct zskiplistNode * forward;

        // span
        unsigned int span;

    } level [];

} zskiplistNode;
The zskiplistNode structure contains the following attributes:

A level array can contain multiple elements: each level has two attributes: a forward pointer and a span. The forward pointer is used to access other nodes located in the tail of the table, and the span records the distance between the node pointed by the forward pointer and the current node. When the program traverses from the head to the end of the table, the access will follow the forward pointer of the layer. The greater the number of layers, the faster the other nodes can be accessed.

Every time a new jump table node is created, the program randomly generates a value between 1 and 32 as the size of the level array, according to the power law the height of.
Layer span with NULL forward pointer is 0
Backward pointer: The backward pointer of the node marked with BW in the node, which points to the previous node located at the current node. The back pointer is used when the program traverses from the end of the table to the head of the table. Unlike the forward pointer, which can skip multiple nodes at once, each node has only one backward pointer, so it can only go back to the previous node at a time.

Score: A double floating point number. In the skip table, the nodes are arranged in ascending order according to their saved scores.

Member object (obj): a pointer to a string object holding an SDS value
In the same jump table, the member objects saved by each node must be unique, but the scores saved by multiple nodes can be the same: nodes with the same score are sorted according to the size of the member object in the lexicographic order, the smaller the In front (near the head)
/ *
 * Jump table
 * /
typedef struct zskiplist {

    // header and footer nodes
    struct zskiplistNode * header, * tail;

    // number of nodes in the table
    unsigned long length;

    // The number of nodes in the table with the largest number of layers
    int level;

} zskiplist;
The zskiplist structure is used to store the relevant information of the skip table nodes, such as the number of nodes, pointers to the head and tail nodes:

header: Pointer to the head of the skip table
tail: point to the end of the table
level: records the number of layers in the current skip table with the highest number of layers (the number of nodes in the table header is not included)
length: record the length of the skip table, that is, the number of nodes currently included in the skip table (header nodes are not counted)
The structure of the header node is the same as the other nodes: the header node also has a back pointer, a score, and a member object, but these attributes of the header node are not used.

Five, integer collection
Keywords: Upgrade rules

An integer set (intset) is one of the underlying implementations of collection keys. When a collection contains only integer value elements and the number of elements in this collection is not large, Redis uses the integer set as the underlying implementation of the collection key.

Data Structure Source
typedef struct intset {

    // Encoding
    uint32_t encoding;

    // number of elements in the collection
    uint32_t length;

    // array holding elements
    int8_t contents [];

} intset;
An integer set (intset) is an abstract data structure used by Redis to store integer values. It can store integer values of type int16_t, int32_t, or int64_t, and ensure that there are no duplicate elements in the set.

The contents array is the underlying implementation of the integer collection: each element of the integer collection is an array item of the contents array, and the items are arranged in ascending order according to the size of the value, and the array contains no duplicate

The length property records the number of elements contained in the integer collection, that is, the length of the contents array

encoding attribute: Although the intset structure declares the contents attribute as an array of type int8_t, the contents array does not actually hold any values of type int8_t. The actual type of the contents array depends on the value of the encoding attribute

If the value of the encoding attribute is INTSET_ENC_INT16, then contents is an array of type int16_t, and each item in the array is an integer value of type int16_t (minimum -32768, maximum 32767)
If the value of the encoding attribute is INTSET_ENC_INT32, then contents is an array of type int32_t, and each item is an integer value of type int32_t (minimum -2147483648, maximum 2147483647)
If the value of the encoding attribute is INTSET_ENC_INT64, then contents is an array of type int64_t, and each item of the array is an integer value of type int64_t (the minimum is -9223372036854775808, and the maximum is 9223372036854775807)
Upgrade Strategy for Integer Collections
When a new element is added to the integer set, and the type of the new element is longer than the type of all existing elements of the integer set, the integer set needs to be upgraded before the new element can be added to the integer set. .

Upgrading integer collections and adding new elements is a three-step process:

Expand the space size of the underlying array of the integer collection and allocate space for the new element according to the type of the new element
All existing elements of the underlying array are converted to the same type as the new element, and the type-converted elements are placed in the correct position. In the process of placing the elements, the ordered nature of the underlying array needs to be maintained. change
Speaking of adding new elements to the underlying array
Because each time you add a new element to an integer collection, it may cause an upgrade, and each upgrade requires a type conversion of all the elements already in the underlying array, so the time complexity of adding a new element to the integer collection is O (N)

The length of a new element that causes an upgrade is always greater than the length of all existing elements of the integer set, so the value of this new element is either greater than all existing elements or smaller than all existing elements:

The new element is smaller than all existing elements, and the new element is placed at the very beginning of the underlying array (index 0)
The new element is larger than all existing elements, and the new element is placed at the end of the underlying array (index length-1)
The upgrade strategy for integer collections has two benefits:

Improve the flexibility of integer collections. You can add integers of type int16_t, int32_t or int64_t to the collection at will, without worrying about type errors.

Save memory. This allows the collection to hold three different types of values at the same time, and ensures that the upgrade operation will only be performed when needed.

Integer collections do not support downgrade operations. Once the array is upgraded, the encoding will remain in the upgraded state.

Compressed list
Keywords: Chain Update

A ziplist is one of the underlying implementations of list keys and hash keys. When a list key contains only a small number of list items, and each list item is either a small integer value or a short string, then Redis is a low-level implementation of a compressed list as a list key

The compressed list was developed by Redis in order to save memory. It is a sequential data structure composed of a series of specially coded contiguous memory blocks. A compressed list can contain any number of nodes (Entry), each node holds a byte array or an integer value.

Data Structure Source
/ *
Blank ziplist example diagram

area | <---- ziplist header ----> | <-end-> |

size 4 bytes 4 bytes 2 bytes 1 byte
            + --------- + -------- + ------- + ----------- +
component | zlbytes | zltail | zllen | zlend |
            | | | | |
value | 1011 | 1010 | 0 | 1111 1111 |
            + --------- + -------- + ------- + ----------- +
                                       ^
                                       |
                               ZIPLIST_ENTRY_HEAD
                                       &
address ZIPLIST_ENTRY_TAIL
                                       &
                               ZIPLIST_ENTRY_END

Non-empty ziplist example diagram

area | <---- ziplist header ----> | <----------- entries -------------> | <-end-> |

size 4 bytes 4 bytes 2 bytes???? 1 byte
            + --------- + -------- + ------- + -------- + -------- + ---- ---- + -------- + ------- +
component | zlbytes | zltail | zllen | entry1 | entry2 | ... | entryN | zlend |
            + --------- + -------- + ------- + -------- + -------- + ---- ---- + -------- + ------- +
                                       ^ ^ ^
address | | |
                                ZIPLIST_ENTRY_HEAD | ZIPLIST_ENTRY_END
                                                                  |
                                                        ZIPLIST_ENTRY_TAIL
* /
zlbytes attribute: uint32_t type, 4 bytes, records the number of bytes of memory occupied by the entire compressed list: used when reallocating the compressed list or calculating the position of zlend
zltail attribute: uint32_t type, 4 bytes, records the number of bytes from the end of the compressed list to the start of the compressed list: With this offset, the address of the end of the table can be determined without traversing the entire compressed list.
zllen attribute: uint16_t type, 2 bytes, records the number of nodes included in the compressed list: when this value is less than uint16_max (65535), this value is the number of nodes included in the compressed list; when this value is equal to uint16_max, the node The true number of points needs to traverse the entire compressed list to calculate
extryX attributes: list nodes, indefinite number of bytes, each node contained in the compressed list, the length of the node is determined by the content of the node
zlend attribute: uint8_t type, 1 byte, special value 0xFF (255 decimal), used to mark the end of the compressed list
/ *
 * Structure for saving ziplist node information
 * /
typedef struct zlentry {

    // prevrawlen: the length of the preceding node
    // prevrawlensize: the size in bytes required to encode prevrawlen
    unsigned int prevrawlensize, prevrawlen;

    // len: the length of the current node value
    // lensize: the size in bytes needed to encode len
    unsigned int lensize, len;

    // the size of the current node header
    // equals prevrawlensize + lensize
    unsigned int headersize;

    // The encoding type used for the current node value
    unsigned char encoding;

    // pointer to the current node
    unsigned char * p;

} zlentry;
Each compressed list node can hold a byte array or an integer value, where the byte array can be one of the following three lengths:

Byte array with length less than or equal to 63 (2 ^ 6-1) bytes
Byte array with length less than or equal to 16383 (2 ^ 14-1) bytes
Byte array with a length of 4294967295 (2 ^ 32-1) bytes
Integer values can be one of the following:

4-bit unsigned integer between 0 and 12
1-byte signed integer
3-byte signed integer
int16_t integer
int32_t integer
int64_t integer
Each compressed list node consists of three parts: previous_entry_length, encoding, and content:

The previous_entry_length attribute of a node records the length of the previous node in the compressed list in bytes. previous_entry_length attribute can be 1 or 5 bytes long

If the length of the previous node is less than 254 bytes, then the length of previous_entry_length is 1 byte: the length of the previous node is stored in this byte

If the length of the previous node is greater than or equal to 254 bytes, then the length of the previous_entry_length attribute is 5 bytes: the first byte of the attribute is set to 0xFE (254 decimal), and the next four bytes are used to save The length of the previous node

Because the previous_entry_length property of the node records the length of the previous node, the program can calculate the starting address of the previous node based on the starting address of the current node through pointer operations.

The traversal operation from the end of the table to the head of the compressed list is implemented using this principle. As long as you have a pointer to the starting address of a node, you can use this pointer and the previous_entry_length property of this node to go forward. A node traces back to the head node of the compressed list.

The encoding attribute records the type and length of data held by the node's content attribute:

One byte, two bytes, or five bytes long, with the highest bit of the value being 00, 01, or 10 is the byte array encoding: this encoding indicates that the node's content property holds the byte array, and the length of the array is removed by the encoding Other bits after the highest two digits
One byte long, the highest bit of the value beginning with 11 is an integer encoding: this encoding indicates that the node's content property holds the integer value, and the type and length of the integer value are recorded by the other bits after the upper two bits of the encoding
The content property holds the value of the node. The value of the node can be a byte array or an integer. The type and length of the value are determined by the encoding property of the node.

Chain update
The operations of adding new nodes and deleting nodes in the compressed list may cause chain updates:

Chain update needs to perform N space reallocation operations on the compressed list in the worst case, and the worst complexity of each space reallocation is O (N), so the worst complexity of chain update is O (N ^ 2 )

Despite the complexity of cascading updates, it is unlikely that it will actually cause performance issues:

The compressed list must have multiple consecutive nodes with a length between 250 and 253 bytes before chain updates can be triggered.
Secondly, even if there are chain updates, as long as the number of nodes being updated is small, there will be no impact on performance.
Object
Keywords: transcoding, polymorphic commands, memory reclamation and sharing, LRU

Redis created an object system based on the above data structure. This system includes five types of objects: string objects, list objects, hash objects, collection objects, and ordered collection objects. Each object uses at least one kind of data. structure.

Benefits of using objects:

Before Redis executes a command, it determines whether an object can execute a given command according to the type of object
You can set a variety of different data structure implementations for objects for different usage scenarios, thereby optimizing the use efficiency of objects in different scenarios
Redis' object system implements a memory reclamation mechanism based on reference counting technology. When a program no longer uses an object, the memory occupied by this object will be automatically released.
Redis also implements an object sharing mechanism through reference counting technology, saving memory by allowing multiple database keys to share the same object
Redis objects have access time record information. This information can be used to calculate the idling time of database keys. When the server's maxmemory function is enabled, those keys that grow up during idling may be preferentially deleted.
Data Structure Source
Redis uses objects to represent the keys and values in the database. When a new key-value pair is created in the database, at least two objects are created: the key object, the key used as the key-value pair, and the value object, the value used as the key-value pair

typedef struct redisObject {

    // Types of
    unsigned type: 4;

    // encoding
    unsigned encoding: 4;

    // The last time the object was accessed, used to calculate the idle time of the object
    // When the amount of memory occupied by the server exceeds the upper limit set by the maxmemory option, those keys with a high idling time are preferentially released by the server, thereby reclaiming memory
    unsigned lru: REDIS_LRU_BITS; / * lru time (relative to server.lruclock) * /

    // reference count
    int refcount;

    // pointer to actual value
    void * ptr;

} robj;
Each object in Redis is represented by a redisObject structure. The type attribute, encoding attribute, and ptr attribute in this structure are related to saving data:

The type attribute records the type of the object. It is a constant. Optional values are REDIS_STRING string object, REDIS_LIST list object, REDIS_HASH hash object, REDIS_SET collection object, REDIS_ZSET ordered collection object.
For the key-value pairs stored in the Redis database, the key is always a string object, and the value can be a string object, a list object, a hash object, a collection object, or an ordered collection object.

The implementation of the type command is similar. When the type command is executed on a database key, the result returned by the command is the type of the value object corresponding to the database key.

The encoding attribute records the encoding used by the object, that is, what data structure the object uses as the underlying implementation of the object

Set the encoding used by the object through encoding, so that Redis can set different encodings for an object according to different usage scenarios, thereby optimizing the efficiency of the object in a certain scenario

String object encoding conversion
The encoding of the string object can be int, raw, or embstr.

If a string object holds an integer value of type long, the string object stores the integer value in the ptr property of the string object structure (converts void * to long), and sets the encoding of the string object to int.

If the string object holds a string value, and the length of the string value is less than or equal to 32 bytes, the string object will use embstr encoding to save the string value.

Floating point numbers that can be represented by long double types are also stored as string values in Redis.

For int-encoded string objects, if we execute some commands to the object so that the object no longer holds an integer but a string value, then the encoding of the string object will change from int to raw.

The embstr-encoded string object is actually read-only. When executing any modification command on an embstr-encoded string object, the program will first convert the object's encoding from embstr to raw before executing the modification command. Therefore, after executing the modify command, the embstr encoded string object will always become a raw encoded string object.

List object encoding conversion
The encoding of the list object can be ziplist or Linkedlist.

A ziplist-encoded list object uses a compressed list as the underlying implementation. Each compressed list node (Entry) holds a list element.

Linkedlist-encoded list objects use double-ended linked lists as the underlying implementation. Each double-ended linked list node stores a string object, and each string object holds a list element.

When a list object meets both of the following conditions, the list object is encoded using ziplist:

All string elements held by the list object are less than 64 bytes in length
List object holds less than 512 elements
Otherwise use linkedlist encoding.

Hash object encoding conversion
The encoding of the hash object can be ziplist or hashtable.

The ziplist-encoded hash object uses a compressed list as the underlying implementation. Whenever a new key-value pair is added to the hash object, the program will first push the compressed list node that holds the key to the end of the compressed list table, and then Push the compressed list node of the saved value to the end of the compressed list table:

The two nodes holding the unified key-value pair are always next to each other, the node holding the key first, and the node holding the value after
The key-value pairs added to the hash object will be placed in the header direction of the compressed list, and the key-value pairs added to the hash object will be placed in the tail direction of the compressed list.
A hashtable-encoded hash object uses a dictionary as the underlying implementation, and each key-value pair in the hash object uses a dictionary key-value pair to hold:

Each key of the dictionary is a string object, and the key of the key-value pair is stored in the object
Each value of the dictionary is a string object, which holds the value of the key-value pair
Hash objects are encoded using ziplist when both of the following conditions are met:

The key and value string lengths of all key-value pairs held by the hash object are less than 64 bytes
The number of key-value pairs held by the hash object is less than 512
Otherwise you need to use hashtable encoding.

Transcoding of collection objects
The encoding of the collection object can be intset or hashtable.

Intset-encoded collection objects use integer collections as the underlying implementation. All elements contained in the collection objects are stored in integer collections.

A hashtable-encoded collection object uses a dictionary as the underlying implementation. Each key of the dictionary is a string object. Each string object contains a collection element, and the dictionary values are all set to null.

Intset encoding is used when the following two conditions are met:

All elements held by the collection object are integer values
Collection object holds no more than 512 elements
Otherwise use hashtable encoding.

The encoding conversion of ordered collection objects
The encoding of ordered sets can be ziplist or skiplist.

The ziplist-encoded ordered collection object uses a compressed list as the underlying implementation. Each collection element is stored using two compressed list nodes next to each other. The first node holds the member of the element, and the second element Save the element's score.

The collection elements in the compressed list are sorted from small to large. The elements with smaller scores are closer to the head of the table and the scores are closer to the end of the table.

The ordered collection object encoded by skiplist uses the zset structure as the underlying implementation. A zset structure contains both a dictionary and a skip table:

/ *
 * Ordered collection
 * /
typedef struct zset {

    // dictionary, key is member, value is score
    // Use to score by member to support O (1) complexity
    dict;

    // jump table, sort members by score
    // Used to support the operation of locating members by scores with an average complexity of O (log N)
    // and range operations
    zskiplist * zsl;

} zset;
The member of each element of the ordered collection is a string object, and the score of each element is a double floating point number.

Although the zset structure uses both a jump table and a dictionary to store ordered set elements, both types of data structures share members and scores of the same element through pointers, so using a jump table and dictionary to save set elements at the same time will not cause duplicate Members and scores do not waste extra memory.

Objects are encoded using ziplist when the following two conditions are met:

Ordered collection holds less than 128 elements
The length of all element members held by the ordered collection is less than 64 bytes
Otherwise the ordered collection object is encoded using skiplist.

Type checking and command polymorphism
The commands used to operate keys in Redis can be divided into two types:

One can be executed on any type of key, such as del command, expire command, rename command, type command, object command
A command executed intelligently on specific types of keys
In execution Before a type-specific command, Redis checks whether the type of the input key is correct, and then decides whether to execute the given command.

The type checking of a type-specific command is implemented through the type property of the redisObject structure:

Before executing a type-specific command, the server first checks whether the value object of the input database key is the type required to execute the command, and if so, executes the command;
Otherwise the server refuses to execute the command and returns a type error to the client.
Redis will also select the correct command implementation code to execute the command according to the encoding method of the object.

Memory reclamation and object sharing
Redis implements a memory reclamation mechanism by reference counting technology.

The reference count information of an object changes continuously with the usage status of the object:

When a new object is created, the value of the reference count is initialized to 1
When an object is used by a new program, its reference count is increased by one
When an object is no longer used by a program, its reference count is decremented
When the object's reference count value becomes 0, the memory occupied by the object is released
The object sharing mechanism based on reference counting makes Redis more memory efficient.

Redis shared objects include string keys, and those objects that have string objects nested in the data structure (linkedlist encoded list objects, hashtable encoded hash objects, hashtable encoded collection objects, zset encoded ordered collection objects ) You can also use these shared objects.

Redis only shares string objects containing integer values.

"Redis Design and Implementation" [Part I] Data Structures and Objects-C source reading (2)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.