The cardinal statistics algorithm in Hyperloglog-redis

Source: Internet
Author: User
1. Basic Concepts

Cardinality (cardinality) refers to the number of different elements in a collection. For example, the collection: {1,2,3,4,5,2,3,9,7}, this collection has 9 elements, but 2 and 3 each appear two times,

So the element that does not repeat is 1,2,3,4,5,9,7, so the cardinality of this set is 7.

Redis added a hyperloglog structure in version 2.8.9. Hyperloglog is used to do cardinality statistics algorithm, hyperloglog the advantage is that in the input element of the

When the quantity or volume is very large, the space required to compute the cardinality is always fixed and small. Inside the Redis, each Hyperloglog key takes only

KB of memory, you can calculate the cardinality of a different element that is close to 2^64. This is in stark contrast to the number of elements that consume more memory as the base is computed.

However, because the Hyperloglog only calculates the cardinality based on the input element and does not store the INPUT element itself, Hyperloglog cannot return the loss as a collection

The individual elements in.

Hyper Loglog uses a hash function to set H (m) to work by using an input data stream M. This produces an observable result of an S = H (M) of {0,1}^∞ string.

By splitting the hash input into an M substring and maintaining the value of M for each child input stream is observable, which is quite a new hyper Loglog (a sub M is a new

Hyper Loglog). Using the average value of the additional observations, a counter is generated with the precision increasing with m, which requires only a single element in the input set

Perform a few steps to complete it. 2. Algorithm Framework

3. Algorithm derivation and proof

Hyperloglog algorithm behind some complex probability and statistical knowledge, interested in looking at the bottom of the paper.

4.java implementation

Implementation of reference Redis source code (HYPERLOGLOG.C), for the Java implementation.

The MurmurHash is used to hash the input collection elements and produce a uniformly distributed hash result.

public class MurmurHash {/** * murmur hash Algorithm implementation */public static long hash64 (byte[] key) {byte
        Buffer buf = Bytebuffer.wrap (key);

        int seed = 0X1234ABCD;
        Byteorder Byteorder = Buf.order ();

        Buf.order (Byteorder.little_endian);
        Long m = 0xc6a4a7935bd1e995l;

        int r = 47;

        Long h = seed ^ (buf.remaining () * m);
        Long K;

            while (Buf.remaining () >= 8) {k = Buf.getlong ();
            K *= m;
            K ^= k >>> R;

            K *= m;
            H ^= K;
        H *= m;
                    } if (buf.remaining () > 0) {bytebuffer finish = bytebuffer.allocate (8). Order (
            Byteorder.little_endian);
            For Big-endian version, does this://Finish.position (8-buf.remaining ());
            Finish.put (BUF). Rewind ();
            H ^= Finish.getlong ();
        H *= m;
        H ^= h >>> R; H *=M

        H ^= h >>> R;
        Buf.order (Byteorder);
    return h; }
}

Hyperloglog Implementation

public class Hyperloglog {private static final int hll_p = bit number of labeled Grouping index in 14;//64 bit hash value, the more the group error the smaller, but the larger the space occupied by private s tatic Final int hll_registers = 1 << hll_p;//Total Group number private static final int hll_bits = 6;//to save the maximum starting 0 statistics for each packet, required Bit amount to be private static final int hll_register_mask = (1 << hll_bits)-1;//statistic 6-bit mask/** * Bitmap storage format, using  Small-end storage, storing the least significant bit first, then storing the most significant bit * +--------+--------+--------+------////--+ * |11000000|22221111|33333322|55444444
     ....     |

    * +--------+--------+--------+------////--+/private byte[] registers; Public Hyperloglog () {//12288+1 (12k) byte, last extra byte equivalent to terminator, no actual use registers = new byte[(hll_registers * HLL
    _bits + 7)/8 + 1];  //alpha coefficient, from reference paper private double alpha (int m) {switch (m) {case 16:return
            0.673;
            Case 32:return 0.697;
            Case 64:return 0.709;
  Default              Return 0.7213/(1 + 1.079/m); }///Save index group value is Val private void setregister (int index, int val) {int _byte = index * Hll_bits/
        8;
        int _FB = index * hll_bits & 7;
        int _FB8 = 8-_FB;
        Registers[_byte] &= ~ (hll_register_mask << _FB);
        Registers[_byte] |= val << _fb;
        Registers[_byte + 1] &= ~ (hll_register_mask >> _fb8);
    Registers[_byte + 1] |= val >> _fb8;
        //Read the value of index group private int getregister (int index) {int _byte = index * HLL_BITS/8;
        int _FB = index * hll_bits & 7;
        int _FB8 = 8-_FB;
        int b0 = Registers[_byte] & 0xff;
        int B1 = Registers[_byte + 1] & 0xFF; Return ((B0 >> _FB) | (B1 << _fb8))
    & Hll_register_mask;
        public int Hlladd (int number) {Long hash = murmurhash.hash64 (integer.tostring (number). GetBytes ()); Long index = Hash &GT;&GT;&GT
        (64-hll_p);

        int oldcount = Getregister ((int) index);
        Calculates a 0 consecutive number of hash values starting from hll_p, including the last 1 hash |= 1l;
        Long bit = 1l << (63-hll_p);
        int count = 1;
            while (hash & bit) = = 0l) {count++;
        Bit >>= 1l;
            } if (Count > Oldcount) {setregister (int) index, count);
        return 1;
        else {return 0;
        }///Estimated cardinality public long hllcount () {//) calculates the harmonic average of each grouping statistic, SUM (2^-reg) Double E = 0;
        int ez = 0;
        Double m = hll_registers;
            for (int i = 0; i < hll_registers i++) {int reg = Getregister (i);
            if (reg = = 0) {ez++;
            else {E + + 1.0d/(1l << reg);

        } E = EZ;

        E = 1/e * Alpha ((int) m) * M * m;
      if (E < m * 2.5 && EZ!= 0) {E = m * Math.log (M/ez);  else if (m = = 16384 && E < 72000) {//from Redis source double bias = 5.9119e-18 * E * e *  E * e-1.4253e-12 * e * e * e + 1.2940e-7 * e * e-5.2921e-3
            * E + 83.3216;
        E-= e * (bias/100);
    Return (long) E; }
}

Test

public class Test {
    //To test the set of n elements public
    static void Testhyperloglog (int n) {
        System.out.println ("n =" + N); 
  hyperloglog Hyperloglog = new Hyperloglog ();
        set<integer> s = new hashset<> ();
        Random Random = new Random ();
        for (int i = 0; i < n; i++) {
            int number = Random.nextint ();
            Hyperloglog.hlladd (number);
            S.add (number);
        }

        System.out.println ("hyperloglog count =" + Hyperloglog.hllcount ());
        System.out.println ("hashset count =" + s.size ());
        SYSTEM.OUT.PRINTLN ("Error rate =" + math.abs (Double) hyperloglog.hllcount ()/S.size ()-1);
    }

    public static void Main (string[] args) {
        int n = 1;
        for (int i = 0; i < 9; i++) {
            n *=
            ; Testhyperloglog (n);}}}

5. Test effect

n is the total number of elements that are produced, the second line Hyperloglog count is the cardinality estimated by the Hyperloglog algorithm, hashset count is the exact result of using HashSet, and the error rate is the wrong rate.

It can be seen that most of the Hyperloglog algorithm error rate is within 1%, when the total number of elements reached 100 million, HashSet reported an exception.

n = Ten Hyperloglog count = ten hashset count = Ten Error rate = 0.0 N = Hyperloglog Count = HashSet count = M Erro R rate = 0.0 n = 1000 hyperloglog count = 1002 HashSet count = 1000 Error Rate = 0.0020000000000000018 N = 10000 HYPERLOGL og count = 9974 hashset count = 10000 error Rate = 0.0026000000000000467 n = 100000 hyperloglog count = 100721 hashset cou NT = 99999 Error rate = 0.007220072200722072 n = 1000000 hyperloglog count = 990325 hashset count = 999883 Error rate = 0. 00955911841685475 n = 10000000 hyperloglog count = 9966476 hashset count = 9988334 Error rate = 0.002188352932531057 N = 1 00000000 Exception in thread "main" Java.lang.OutOfMemoryError:Java heap spaces at Java.util.HashMap.resize (Hashmap.java : 703) at Java.util.HashMap.putVal (hashmap.java:662) at Java.util.HashMap.put (hashmap.java:611) at Java.util.HashSet.add (hashset.java:219) at Com.sankuai.alg.Test.testHyperLogLog (test.java:24) at Com.sankuai.alg.Test.main (test.java:36) at Sun.reflect.NativeMethodACcessorimpl.invoke0 (Native method) at Sun.reflect.NativeMethodAccessorImpl.invoke (Nativemethodaccessorimpl.java (delegatingmethodaccessorimpl.java:43) at Sun.reflect.DelegatingMethodAccessorImpl.invoke Java.lang.reflect.Method.invoke (method.java:497) at Com.intellij.rt.execution.application.AppMain.main ( appmain.java:140) Process finished with exit code 1


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.