We have always known that Redis a few of the most commonly used data structures, strings, hashes, lists, sets, ordered sets. In fact, later Redis made a lot of supplements, one of which is Hyperloglog, the other is geo (geographical location), is 3.2 version plus.
Here we have a brief introduction to the Hyperloglog structure.
First of all use: This structure can be very memory to the statistics of the various counts, such as the number of registered IP, the number of daily access to IP, page real-time UV (PV affirmative string is done), the number of online users.
See all the use here is XXX number, so the characteristics of this data structure is that you can more accurately estimate the number you want to count, but do not know the details of the statistics. For example, statistics daily access IP number, you can get access to the total number of IP, but do not know what these IP is.
There is a loss, of course, you have to count the above mentioned content, you can use the collection to deal with, so you can know the number, you can get all the detailed list. But a large web site, IP for example, 1 million each day, we rough calculate an IP consumption of 15 bytes, then 1 million IP is 15M, if 10 million, is 150M.
Take a look at our hyperloglog, in Redis each key occupies a content of 12K, theoretical storage approximate 2^64 value, regardless of what is stored content. 12K, know the effect of this data structure. That's why he doesn't know the details. This is a base based estimation algorithm, can only be more accurate estimation of the cardinality, the use of a small amount of fixed memory to store and identify the unique elements in the collection. And the cardinality of this estimate is not necessarily accurate, it is an approximate value with 0.81% standard error (standard error).
The HYPERLOGLOG structure, regardless of the number of values allowed in the range, will only occupy 12K of memory.
So for example, we record the daily IP, assuming that there are 100 million IP access every day, if the use of the collection, the day's memory use is 1.5G, assuming we store one months of records, we need 45G capacity. But using Hyperloglog, 12K a day, one months 360K. If we do not need to know the specific IP information, we can leave these records in memory for a year, or do not delete all lines. If necessary, we will also store all the IP access records in other ways. The daily information is stored, we can calculate the total number of IP per month (MERGE), the total number of IP in a year, etc. (to heavy).
Here is an introduction to Hyperloglog's command, in fact, he and the collection of the command more like, but only a few commands, can not get the list. In addition this data structure needs to 2.8.9 and above version can use Oh ~ pfadd
After executing this command, the internal structure of the Hyperloglog is updated and feedback is made, and if the base estimate within the hyperloglog is changed after execution, it returns 1, otherwise (if it already exists) returns 0.
This command also has a comparison artifact is the only key, there is no value, so that means just create an empty key, do not put the value.
If the key exists, does nothing, returns 0, creates it if it does not exist, and returns 1.
The time complexity of this command is O (1), so feel free to use it ~
command example:
redis> pfadd ip:20160929 "1.1.1.1" "2.2.2.2" "3.3.3.3"
(integer) 1
redis> Pfadd ip:20160929 "2.2.2.2" "4.4.4.4" "5.5.5.5" # only add new
(integer) 1
redis> pfcount ip:20160929 # Element Estimated quantity unchanged
(integer) 5
redis> pfadd ip:20160929 "2.2.2.2" # Existence will not increase
(integer) 0
In fact, we found that in less time is quite accurate, haha. Pfcount
In fact, in the above study we have used this, here to introduce next.
Returns the cardinality estimate of this key when the command acts on a single key. If the key does not exist, it returns 0.
When used for multiple keys, returns the set estimate of these keys. Similar to having these keys merged, after invoking this command output.
This command, when acting on a single value, has a time complexity of O (1), and has a very low average constant time, and when applied to N values, the time complexity is O (n), and the constant complexity of the command is lower.
command example:
redis> pfadd ip:20160929 "1.1.1.1" "2.2.2.2" "3.3.3.3"
(integer) 1
redis> Pfcount ip:20160929
(integer) 3
redis> pfadd ip:20160928 "1.1.1.1" "4.4.4.4" "5.5.5.5"
(integer) 1
redis> pfcount ip:20160928 ip:20160929
(integer) 5
Pfmerge
Combine (merge) multiple hyperloglog as a hyperloglog. This is also well understood, and the combined estimate cardinality is similar to the set of all Hyperloglog estimate cardinality.
The first parameter of this command is the target key, and the remaining parameter is the Hyperloglog to be merged. When the command executes, if the target key does not exist, then the merge is created and then executed.
The time complexity of this command is O (n), where n is the number of Hyperloglog to be merged. However, the constant time complexity of this command is relatively high.
command example:
redis> PFADD ip:20160929 "1.1.1.1" "2.2.2.2" "3.3.3.3"
(integer) 1
redis> PFADD ip:20160928 "1.1.1.1" "4.4.4.4" "5.5.5.5"
(integer) 1
redis> PFMERGE ip:201609 ip:20160928 ip:20160929
OK
redis> PFCOUNT ip:201609
(integer) 5