Redis source code analysis (26) --- slowlog and hyperloglog

Last Update:2014-11-02 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today I learned two log files. The implementation functions of these two files are beyond what I originally understood. At the beginning, I thought it was to record different types of logs, and then I gradually understood the amount. slowlog records the query records for timeout, while hyperloglog actually has nothing to do with logs, well, I am dumb again. It is actually a base statistics algorithm. We should read it separately for the calculation of hyper + loglog. Okay. Next, let's start to learn how redis code is implemented.

Official explanation of slowlog:

/* Slowlog implements a system that is able to remember the latest N * queries that took more than m microseconds to execute. ** the execution time to reach to be logged in the slow log is set * using the 'slowlog-log-slower-than 'config ctive ve, that is also * readable and writable using the config set/get command. ** the slow queries log is actually not "logged" in the redis Log File * but is accessible thanks to the slowlog command. ** the general meaning is that slowlog records the last n queries that have exceeded a certain period of time, that is, relatively time-consuming queries *----------------------------------------------------------------------------

It defines a structure of slowlog entry:

/* This structure defines an entry inside the slow log list * // * slow log structure, which will be inserted into the slowloglist, in the slow log list, */typedef struct slowlogentry {robj ** argv; int argc; // long ID as its own ID;/* unique entry identifier. * // the time consumed by the query operation, in the unit of nanoseconds long duration;/* time spent by the query, in nanoseconds. * /// query the time when the query occurs time_t time;/* UNIX time at which the query was executed. */} slowlogentry;/* exported API */void slowloginit (void);/* slowlog initialization operation */void slowlogpushentryifneeded (robj ** argv, int argc, long duration ); /* slowlogentry push-in List Operation * // * exported commands * // The command opened to the system */void slowlogcommand (redisclient * C );

The method defined in it is also very simple. Initialize the init method and insertion method. on the server side of the server, maintain a slowlog list and insert timeout query records in chronological order, that is, slowlogentry records:

/* Initialize the slow log. this function shoshould be called a single time * at server startup. * // * slowlog initialization operation */void slowloginit (void) {// create a slowlog list server. slowlog = listcreate (); // The first entry_id is declared as 0 server. slowlog_entry_id = 0; listsetfreemethod (server. slowlog, slowlogfreeentry );}

To insert a list:

/* Push a new entry into the slow log. * This function will make sure to trim the slow log accordingly to the * configured max length. * // * Insert an entry to the slowlog list. If the time exceeds the specified time range, */void slowlogpushentryifneeded (robj ** argv, int argc, long duration) {If (server. slowlog_log_slower_than <0) return;/* slowlog disabled */If (duration> = server. slowlog_log_slower_than) // If the entry's duration time exceeds the slowlog_log_slower_than time, add listaddnodehead (server. slowlog, slowlogcreateentry (argv, argc, duration);/* remove old entries if needed. */while (listlength (server. slowlog)> server. slowlog_max_len) // if the list length exceeds the maximum slowlog value, remove the last slowlogentry listdelnode (server. slowlog, listlast (server. slowlog ));}

Slowlog is simple and clear. It focuses on hyperloglog as a base statistics algorithm, such as the number of different words in a Shakespeare article, if we store all the words in the hashset according to our usual practice, we can find the capacity. But when we are dealing with massive data, How much memory does this occupy, therefore, with the "bitmap method" we mentioned later, bitmaps can quickly and accurately obtain the base number of a given input. The basic idea of bitmap is to map a dataset to a bit using a hash function. Each input element corresponds to a bit. In this way, hash will not generate collision conflicts, and reduce the need to map each element to one bit space. Although bit-map saves a lot of storage space, they still have problems when there is a large base or a very large number of different datasets. Fortunately, as an emerging field, base statistics has implemented many open-source algorithms. The idea of base statistics algorithms is to exchange accuracy for space, which can be slightly less accurate, however, the occupied space can be greatly reduced. The following three typical base statistics algorithms are found on the Internet: Java hashset, linear probabilistic counter, and hyper loglog counter, I am talking about the second and third types.

The linear probabilistic counter linear probability counter is an efficient space for use and allows the implementer to specify the desired precision level. This algorithm is useful when focusing on space efficiency, but you need to be able to control the error of the result. The algorithm runs in two steps: Step 1: first, allocate a bit-map whose initialization value is 0 in the memory, and then use the hash function to calculate the hash of each entry in the input data, the hash function maps each record (or element) to a bit of bit-map. The bit is set to 1. Step 2, the algorithm calculates the number of null bits and uses this number to input the following formula for estimation:
N =-M ln VN
Note: ln VN = LogE (VN) natural logarithm
In the formula, M is the size of bit-map, and Vn is the ratio of null bit to map size. Note that the size of the original bit-map can be much smaller than the expected maximum base. The small size depends on the amount of error you can bear. Because the size of bit-map M is smaller than the total number of different elements, a collision will occur. Although collision can save space, it also causes errors in the estimation results. Therefore, by controlling the size of the original map, we can estimate the number of collisions so that we can see the error in the final result.

Hyperloglog provides algorithms that are more efficient than above. As the name suggests, the hyper loglog counter is used to estimate the data set with the Nmax base. You only need to use loglog (Nmax) + O (1) bits. For example, the hyper loglog counter of the linear counter allows the designer to specify the expected precision value. In the case of hyper loglog, this is by defining the required relative standard deviation and the maximum base to be counted. Most counters work through an input data stream m and apply a hash function to set h (m. This will generate an observed result of S = H (M) of {0, 1} ^ ∞ string. By dividing the hash input stream into M sub-strings, and keeping m values for each sub-input stream observability, this is a new hyper loglog (a sub-M is a new hyper loglog ). The average value of an additional observed value is used to generate a counter, and its precision increases with M. This only requires performing a few steps on each element in the input set. The result is that this counter can only use 1.5 kb space to calculate billions of different data elements with an accuracy of 2%. Compared with the 120 MB required for executing a hashset, this algorithm is very efficient. This is the legendary "how to count billions of objects with only KB of memory".

Redis source code analysis (26) --- slowlog and hyperloglog

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More