1. Basic Concepts
Cardinality (cardinality) refers to the number of different elements in a collection. For example, the collection: {1,2,3,4,5,2,3,9,7}, this collection has 9 elements, but 2 and 3 each appear two times,
So the element that does not repeat is 1,2,3,4,5,9,7, so the cardinality of this set is 7.
Redis added a hyperloglog structure in version 2.8.9. Hyperloglog is used to do cardinality statistics algorithm, hyperloglog the advantage is that in the input element of the
When the quantity or volume is very large, the space required to compute the cardinality is always fixed and small. Inside the Redis, each Hyperloglog key takes only
KB of memory, you can calculate the cardinality of a different element that is close to 2^64. This is in stark contrast to the number of elements that consume more memory as the base is computed.
However, because the Hyperloglog only calculates the cardinality based on the input element and does not store the INPUT element itself, Hyperloglog cannot return the loss as a collection
The individual elements in.
Hyper Loglog uses a hash function to set H (m) to work by using an input data stream M. This produces an observable result of an S = H (M) of {0,1}^∞ string.
By splitting the hash input into an M substring and maintaining the value of M for each child input stream is observable, which is quite a new hyper Loglog (a sub M is a new
Hyper Loglog). Using the average value of the additional observations, a counter is generated with the precision increasing with m, which requires only a single element in the input set
Perform a few steps to complete it. 2. Algorithm Framework
3. Algorithm derivation and proof
Hyperloglog algorithm behind some complex probability and statistical knowledge, interested in looking at the bottom of the paper.
4.java implementation
Implementation of reference Redis source code (HYPERLOGLOG.C), for the Java implementation.
The MurmurHash is used to hash the input collection elements and produce a uniformly distributed hash result.
public class MurmurHash {/** * murmur hash Algorithm implementation */public static long hash64 (byte[] key) {byte
Buffer buf = Bytebuffer.wrap (key);
int seed = 0X1234ABCD;
Byteorder Byteorder = Buf.order ();
Buf.order (Byteorder.little_endian);
Long m = 0xc6a4a7935bd1e995l;
int r = 47;
Long h = seed ^ (buf.remaining () * m);
Long K;
while (Buf.remaining () >= 8) {k = Buf.getlong ();
K *= m;
K ^= k >>> R;
K *= m;
H ^= K;
H *= m;
} if (buf.remaining () > 0) {bytebuffer finish = bytebuffer.allocate (8). Order (
Byteorder.little_endian);
For Big-endian version, does this://Finish.position (8-buf.remaining ());
Finish.put (BUF). Rewind ();
H ^= Finish.getlong ();
H *= m;
H ^= h >>> R; H *=M
H ^= h >>> R;
Buf.order (Byteorder);
return h; }
}
Hyperloglog Implementation
public class Hyperloglog {private static final int hll_p = bit number of labeled Grouping index in 14;//64 bit hash value, the more the group error the smaller, but the larger the space occupied by private s tatic Final int hll_registers = 1 << hll_p;//Total Group number private static final int hll_bits = 6;//to save the maximum starting 0 statistics for each packet, required Bit amount to be private static final int hll_register_mask = (1 << hll_bits)-1;//statistic 6-bit mask/** * Bitmap storage format, using Small-end storage, storing the least significant bit first, then storing the most significant bit * +--------+--------+--------+------////--+ * |11000000|22221111|33333322|55444444
.... |
* +--------+--------+--------+------////--+/private byte[] registers; Public Hyperloglog () {//12288+1 (12k) byte, last extra byte equivalent to terminator, no actual use registers = new byte[(hll_registers * HLL
_bits + 7)/8 + 1]; //alpha coefficient, from reference paper private double alpha (int m) {switch (m) {case 16:return
0.673;
Case 32:return 0.697;
Case 64:return 0.709;
Default Return 0.7213/(1 + 1.079/m); }///Save index group value is Val private void setregister (int index, int val) {int _byte = index * Hll_bits/
8;
int _FB = index * hll_bits & 7;
int _FB8 = 8-_FB;
Registers[_byte] &= ~ (hll_register_mask << _FB);
Registers[_byte] |= val << _fb;
Registers[_byte + 1] &= ~ (hll_register_mask >> _fb8);
Registers[_byte + 1] |= val >> _fb8;
//Read the value of index group private int getregister (int index) {int _byte = index * HLL_BITS/8;
int _FB = index * hll_bits & 7;
int _FB8 = 8-_FB;
int b0 = Registers[_byte] & 0xff;
int B1 = Registers[_byte + 1] & 0xFF; Return ((B0 >> _FB) | (B1 << _fb8))
& Hll_register_mask;
public int Hlladd (int number) {Long hash = murmurhash.hash64 (integer.tostring (number). GetBytes ()); Long index = Hash >>>
(64-hll_p);
int oldcount = Getregister ((int) index);
Calculates a 0 consecutive number of hash values starting from hll_p, including the last 1 hash |= 1l;
Long bit = 1l << (63-hll_p);
int count = 1;
while (hash & bit) = = 0l) {count++;
Bit >>= 1l;
} if (Count > Oldcount) {setregister (int) index, count);
return 1;
else {return 0;
}///Estimated cardinality public long hllcount () {//) calculates the harmonic average of each grouping statistic, SUM (2^-reg) Double E = 0;
int ez = 0;
Double m = hll_registers;
for (int i = 0; i < hll_registers i++) {int reg = Getregister (i);
if (reg = = 0) {ez++;
else {E + + 1.0d/(1l << reg);
} E = EZ;
E = 1/e * Alpha ((int) m) * M * m;
if (E < m * 2.5 && EZ!= 0) {E = m * Math.log (M/ez); else if (m = = 16384 && E < 72000) {//from Redis source double bias = 5.9119e-18 * E * e * E * e-1.4253e-12 * e * e * e + 1.2940e-7 * e * e-5.2921e-3
* E + 83.3216;
E-= e * (bias/100);
Return (long) E; }
}
Test
public class Test {
//To test the set of n elements public
static void Testhyperloglog (int n) {
System.out.println ("n =" + N);
hyperloglog Hyperloglog = new Hyperloglog ();
set<integer> s = new hashset<> ();
Random Random = new Random ();
for (int i = 0; i < n; i++) {
int number = Random.nextint ();
Hyperloglog.hlladd (number);
S.add (number);
}
System.out.println ("hyperloglog count =" + Hyperloglog.hllcount ());
System.out.println ("hashset count =" + s.size ());
SYSTEM.OUT.PRINTLN ("Error rate =" + math.abs (Double) hyperloglog.hllcount ()/S.size ()-1);
}
public static void Main (string[] args) {
int n = 1;
for (int i = 0; i < 9; i++) {
n *=
; Testhyperloglog (n);}}}
5. Test effect
n is the total number of elements that are produced, the second line Hyperloglog count is the cardinality estimated by the Hyperloglog algorithm, hashset count is the exact result of using HashSet, and the error rate is the wrong rate.
It can be seen that most of the Hyperloglog algorithm error rate is within 1%, when the total number of elements reached 100 million, HashSet reported an exception.
n = Ten Hyperloglog count = ten hashset count = Ten Error rate = 0.0 N = Hyperloglog Count = HashSet count = M Erro R rate = 0.0 n = 1000 hyperloglog count = 1002 HashSet count = 1000 Error Rate = 0.0020000000000000018 N = 10000 HYPERLOGL og count = 9974 hashset count = 10000 error Rate = 0.0026000000000000467 n = 100000 hyperloglog count = 100721 hashset cou NT = 99999 Error rate = 0.007220072200722072 n = 1000000 hyperloglog count = 990325 hashset count = 999883 Error rate = 0. 00955911841685475 n = 10000000 hyperloglog count = 9966476 hashset count = 9988334 Error rate = 0.002188352932531057 N = 1 00000000 Exception in thread "main" Java.lang.OutOfMemoryError:Java heap spaces at Java.util.HashMap.resize (Hashmap.java : 703) at Java.util.HashMap.putVal (hashmap.java:662) at Java.util.HashMap.put (hashmap.java:611) at Java.util.HashSet.add (hashset.java:219) at Com.sankuai.alg.Test.testHyperLogLog (test.java:24) at Com.sankuai.alg.Test.main (test.java:36) at Sun.reflect.NativeMethodACcessorimpl.invoke0 (Native method) at Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java (delegatingmethodaccessorimpl.java:43) at Sun.reflect.DelegatingMethodAccessorImpl.invoke Java.lang.reflect.Method.invoke (method.java:497) at Com.intellij.rt.execution.application.AppMain.main ( appmain.java:140) Process finished with exit code 1