HyperLogLog-Redis中的基數統計演算法

來源:互聯網
上載者:User
1.基本概念

基數(cardinality),是指一個集合中不同元素的個數。例如集合:{1,2,3,4,5,2,3,9,7}, 這個集合有9個元素,但是2和3各出現了兩次,

因此不重複的元素為1,2,3,4,5,9,7,所以這個集合的基數是7。

Redis 在 2.8.9 版本添加了 HyperLogLog 結構。HyperLogLog 是用來做基數統計的演算法,HyperLogLog 的優點是,在輸入元素的

數量或者體積非常非常大時,計算基數所需的空間總是固定 的、並且是很小的。在 Redis 裡面,每個 HyperLogLog 鍵只需要花費

12 KB 記憶體,就可以計算接近 2^64 個不同元素的基 數。這和計算基數時,元素越多耗費記憶體就越多的集合形成鮮明對比。

但是,因為 HyperLogLog 只會根據輸入元素來計算基數,而不會儲存輸入元素本身,所以 HyperLogLog 不能像集合那樣,返回輸

入的各個元素。

Hyper LogLog通過對一個輸入資料流M,應用一個雜湊函數設定h(M)來工作。這將產生一個S = h(M) of {0,1}^∞字串的可觀測結果。

通過分割雜湊輸入資料流成m個子字串,並對每個子輸入資料流保持m的值可觀測 ,這就是相當一個新Hyper LogLog(一個子m就是一個新

的Hyper LogLog)。利用額外的觀測值的平均值,產生一個計數器,其精度隨著m的增長而提高,這隻需要對輸入集合中的每個元素

執行幾步操作就可以完成。 2.演算法架構

3.演算法推導和證明

hyperloglog演算法背後是一些複雜的機率和統計知識,感興趣的看下方的論文。

4.java實現

實現參考redis的源碼(hyperloglog.c),進行了java實現。

murmurhash用來對輸入的集合元素進行hash,併產生均勻分布的hash結果。

public class MurmurHash {    /**     * murmur hash演算法實現     */    public static long hash64(byte[] key) {        ByteBuffer buf = ByteBuffer.wrap(key);        int seed = 0x1234ABCD;        ByteOrder byteOrder = buf.order();        buf.order(ByteOrder.LITTLE_ENDIAN);        long m = 0xc6a4a7935bd1e995L;        int r = 47;        long h = seed ^ (buf.remaining() * m);        long k;        while (buf.remaining() >= 8) {            k = buf.getLong();            k *= m;            k ^= k >>> r;            k *= m;            h ^= k;            h *= m;        }        if (buf.remaining() > 0) {            ByteBuffer finish = ByteBuffer.allocate(8).order(                    ByteOrder.LITTLE_ENDIAN);            // for big-endian version, do this first:            // finish.position(8-buf.remaining());            finish.put(buf).rewind();            h ^= finish.getLong();            h *= m;        }        h ^= h >>> r;        h *= m;        h ^= h >>> r;        buf.order(byteOrder);        return h;    }}

hyperloglog實現

public class HyperLogLog {    private static final int HLL_P = 14;//64位hash值中標記分組索引的bit數量,分組越多誤差越小,但佔用的空間越大    private static final int HLL_REGISTERS = 1 << HLL_P;//總的分組數量    private static final int HLL_BITS = 6;//為儲存每一個分組中最大起始0統計量,所需要的bit數量    private static final int HLL_REGISTER_MASK = (1 << HLL_BITS) - 1;//統計量的6位元遮罩    /**     * bitmap儲存格式,採用小端儲存,先儲存最低有效位,然後儲存最高有效位     * +--------+--------+--------+------//      //--+     * |11000000|22221111|33333322|55444444 ....     |     * +--------+--------+--------+------//      //--+     */    private byte[] registers;    public HyperLogLog() {        //12288+1(12k)個位元組,最後一個額外的位元組相當於結束符,並沒有實際用途        registers = new byte[(HLL_REGISTERS * HLL_BITS + 7) / 8 + 1];    }    //alpha係數,來自參考論文    private double alpha(int m) {        switch (m) {            case 16:                return 0.673;            case 32:                return 0.697;            case 64:                return 0.709;            default:                return 0.7213 / (1 + 1.079 / m);        }    }    //儲存第index分組的值為val    private void setRegister(int index, int val) {        int _byte = index * HLL_BITS / 8;        int _fb = index * HLL_BITS & 7;        int _fb8 = 8 - _fb;        registers[_byte] &= ~(HLL_REGISTER_MASK << _fb);        registers[_byte] |= val << _fb;        registers[_byte + 1] &= ~(HLL_REGISTER_MASK >> _fb8);        registers[_byte + 1] |= val >> _fb8;    }    //讀取第index分組的值    private int getRegister(int index) {        int _byte = index * HLL_BITS / 8;        int _fb = index * HLL_BITS & 7;        int _fb8 = 8 - _fb;        int b0 = registers[_byte] & 0xff;        int b1 = registers[_byte + 1] & 0xff;        return ((b0 >> _fb) | (b1 << _fb8)) & HLL_REGISTER_MASK;    }    public int hllAdd(int number) {        long hash = MurmurHash.hash64(Integer.toString(number).getBytes());        long index = hash >>> (64 - HLL_P);        int oldcount = getRegister((int) index);        //計算hash值中從HLL_P為開始的連續0數量,包括最後一個1        hash |= 1l;        long bit = 1l << (63 - HLL_P);        int count = 1;        while ((hash & bit) == 0l) {            count++;            bit >>= 1l;        }        if (count > oldcount) {            setRegister((int) index, count);            return 1;        } else {            return 0;        }    }    //估算基數    public long hllCount() {        //計算各分組統計量的調和平均數,SUM(2^-reg)        double E = 0;        int ez = 0;        double m = HLL_REGISTERS;        for (int i = 0; i < HLL_REGISTERS; i++) {            int reg = getRegister(i);            if (reg == 0) {                ez++;            } else {                E += 1.0d / (1l << reg);            }        }        E += ez;        E = 1 / E * alpha((int) m) * m * m;        if (E < m * 2.5 && ez != 0) {            E = m * Math.log(m / ez);        } else if (m == 16384 && E < 72000) {            //來自redis源碼            double bias = 5.9119e-18 * E * E * E * E                    - 1.4253e-12 * E * E * E                    + 1.2940e-7 * E * E                    - 5.2921e-3 * E                    + 83.3216;            E -= E * (bias / 100);        }        return (long) E;    }}

測試

public class Test {    //測試n個元素的集合    public static void testHyperLogLog(int n) {        System.out.println("n = " + n);        HyperLogLog hyperLogLog = new HyperLogLog();        Set<Integer> s = new HashSet<>();        Random random = new Random();        for (int i = 0; i < n; i++) {            int number = random.nextInt();            hyperLogLog.hllAdd(number);            s.add(number);        }        System.out.println("hyperLogLog count = " + hyperLogLog.hllCount());        System.out.println("hashset count = " + s.size());        System.out.println("error rate = " + Math.abs((double) hyperLogLog.hllCount() / s.size() - 1));    }    public static void main(String[] args) {        int n = 1;        for (int i = 0; i < 9; i++) {            n *= 10;            testHyperLogLog(n);        }    }}

5.測試效果

n為產生的隨即元素總個數,第二行hyperLogLog count為hyperLogLog演算法估計的基數,hashset count為使用hashset統計出的精確結果,error rate為錯誤率。

可以看出大部分情況下hyperloglog演算法的錯誤率都在1%以內,當元素總個數達到1億時,hashset報出異常。

n = 10hyperLogLog count = 10hashset count = 10error rate = 0.0n = 100hyperLogLog count = 100hashset count = 100error rate = 0.0n = 1000hyperLogLog count = 1002hashset count = 1000error rate = 0.0020000000000000018n = 10000hyperLogLog count = 9974hashset count = 10000error rate = 0.0026000000000000467n = 100000hyperLogLog count = 100721hashset count = 99999error rate = 0.007220072200722072n = 1000000hyperLogLog count = 990325hashset count = 999883error rate = 0.00955911841685475n = 10000000hyperLogLog count = 9966476hashset count = 9988334error rate = 0.002188352932531057n = 100000000Exception in thread "main" java.lang.OutOfMemoryError: Java heap spaceat java.util.HashMap.resize(HashMap.java:703)at java.util.HashMap.putVal(HashMap.java:662)at java.util.HashMap.put(HashMap.java:611)at java.util.HashSet.add(HashSet.java:219)at com.sankuai.alg.Test.testHyperLogLog(Test.java:24)at com.sankuai.alg.Test.main(Test.java:36)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:497)at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)Process finished with exit code 1


相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.