Redis Cardinality Statistics: hyperloglog small Memory large use

Last Update:2018-08-23 Source: Internet

Author: User

Tags ip number

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We have always known that Redis a few of the most commonly used data structures, strings, hashes, lists, sets, ordered sets. In fact, later Redis made a lot of supplements, one of which is Hyperloglog, the other is geo (geographical location), is 3.2 version plus.

Here we have a brief introduction to the Hyperloglog structure.

First of all use: This structure can be very memory to the statistics of the various counts, such as the number of registered IP, the number of daily access to IP, page real-time UV (PV affirmative string is done), the number of online users.

See all the use here is XXX number, so the characteristics of this data structure is that you can more accurately estimate the number you want to count, but do not know the details of the statistics. For example, statistics daily access IP number, you can get access to the total number of IP, but do not know what these IP is.

There is a loss, of course, you have to count the above mentioned content, you can use the collection to deal with, so you can know the number, you can get all the detailed list. But a large web site, IP for example, 1 million each day, we rough calculate an IP consumption of 15 bytes, then 1 million IP is 15M, if 10 million, is 150M.

Take a look at our hyperloglog, in Redis each key occupies a content of 12K, theoretical storage approximate 2^64 value, regardless of what is stored content. 12K, know the effect of this data structure. That's why he doesn't know the details. This is a base based estimation algorithm, can only be more accurate estimation of the cardinality, the use of a small amount of fixed memory to store and identify the unique elements in the collection. And the cardinality of this estimate is not necessarily accurate, it is an approximate value with 0.81% standard error (standard error).

The HYPERLOGLOG structure, regardless of the number of values allowed in the range, will only occupy 12K of memory.

So for example, we record the daily IP, assuming that there are 100 million IP access every day, if the use of the collection, the day's memory use is 1.5G, assuming we store one months of records, we need 45G capacity. But using Hyperloglog, 12K a day, one months 360K. If we do not need to know the specific IP information, we can leave these records in memory for a year, or do not delete all lines. If necessary, we will also store all the IP access records in other ways. The daily information is stored, we can calculate the total number of IP per month (MERGE), the total number of IP in a year, etc. (to heavy).

Here is an introduction to Hyperloglog's command, in fact, he and the collection of the command more like, but only a few commands, can not get the list. In addition this data structure needs to 2.8.9 and above version can use Oh ~ pfadd

After executing this command, the internal structure of the Hyperloglog is updated and feedback is made, and if the base estimate within the hyperloglog is changed after execution, it returns 1, otherwise (if it already exists) returns 0.
This command also has a comparison artifact is the only key, there is no value, so that means just create an empty key, do not put the value.
If the key exists, does nothing, returns 0, creates it if it does not exist, and returns 1.

The time complexity of this command is O (1), so feel free to use it ~

command example:

redis> pfadd  ip:20160929  "1.1.1.1"  "2.2.2.2"  "3.3.3.3"
(integer) 1
redis> Pfadd  ip:20160929 "2.2.2.2"  "4.4.4.4"  "5.5.5.5"  # only add new
(integer) 1
redis>  pfcount ip:20160929  # Element Estimated quantity unchanged
(integer) 5
redis> pfadd  ip:20160929 "2.2.2.2"  # Existence will not increase
(integer) 0

In fact, we found that in less time is quite accurate, haha. Pfcount

In fact, in the above study we have used this, here to introduce next.

Returns the cardinality estimate of this key when the command acts on a single key. If the key does not exist, it returns 0.
When used for multiple keys, returns the set estimate of these keys. Similar to having these keys merged, after invoking this command output.

This command, when acting on a single value, has a time complexity of O (1), and has a very low average constant time, and when applied to N values, the time complexity is O (n), and the constant complexity of the command is lower.

command example:

redis> pfadd  ip:20160929  "1.1.1.1"  "2.2.2.2"  "3.3.3.3"
(integer) 1
redis> Pfcount  ip:20160929
(integer) 3
redis> pfadd  ip:20160928  "1.1.1.1"  "4.4.4.4"  "5.5.5.5"
(integer) 1
redis> pfcount  ip:20160928  ip:20160929
(integer) 5

Pfmerge

Combine (merge) multiple hyperloglog as a hyperloglog. This is also well understood, and the combined estimate cardinality is similar to the set of all Hyperloglog estimate cardinality.

The first parameter of this command is the target key, and the remaining parameter is the Hyperloglog to be merged. When the command executes, if the target key does not exist, then the merge is created and then executed.

The time complexity of this command is O (n), where n is the number of Hyperloglog to be merged. However, the constant time complexity of this command is relatively high.

command example:

redis> PFADD  ip:20160929  "1.1.1.1"  "2.2.2.2"  "3.3.3.3"
(integer) 1
redis> PFADD  ip:20160928  "1.1.1.1"  "4.4.4.4"  "5.5.5.5"
(integer) 1
redis> PFMERGE ip:201609   ip:20160928   ip:20160929
OK
redis> PFCOUNT  ip:201609
(integer) 5

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More