1.基本概念
基數(cardinality),是指一個集合中不同元素的個數。例如集合:{1,2,3,4,5,2,3,9,7}, 這個集合有9個元素,但是2和3各出現了兩次,
因此不重複的元素為1,2,3,4,5,9,7,所以這個集合的基數是7。
Redis 在 2.8.9 版本添加了 HyperLogLog 結構。HyperLogLog 是用來做基數統計的演算法,HyperLogLog 的優點是,在輸入元素的
數量或者體積非常非常大時,計算基數所需的空間總是固定 的、並且是很小的。在 Redis 裡面,每個 HyperLogLog 鍵只需要花費
12 KB 記憶體,就可以計算接近 2^64 個不同元素的基 數。這和計算基數時,元素越多耗費記憶體就越多的集合形成鮮明對比。
但是,因為 HyperLogLog 只會根據輸入元素來計算基數,而不會儲存輸入元素本身,所以 HyperLogLog 不能像集合那樣,返回輸
入的各個元素。
Hyper LogLog通過對一個輸入資料流M,應用一個雜湊函數設定h(M)來工作。這將產生一個S = h(M) of {0,1}^∞字串的可觀測結果。
通過分割雜湊輸入資料流成m個子字串,並對每個子輸入資料流保持m的值可觀測 ,這就是相當一個新Hyper LogLog(一個子m就是一個新
的Hyper LogLog)。利用額外的觀測值的平均值,產生一個計數器,其精度隨著m的增長而提高,這隻需要對輸入集合中的每個元素
執行幾步操作就可以完成。 2.演算法架構
3.演算法推導和證明
hyperloglog演算法背後是一些複雜的機率和統計知識,感興趣的看下方的論文。
4.java實現
實現參考redis的源碼(hyperloglog.c),進行了java實現。
murmurhash用來對輸入的集合元素進行hash,併產生均勻分布的hash結果。
public class MurmurHash { /** * murmur hash演算法實現 */ public static long hash64(byte[] key) { ByteBuffer buf = ByteBuffer.wrap(key); int seed = 0x1234ABCD; ByteOrder byteOrder = buf.order(); buf.order(ByteOrder.LITTLE_ENDIAN); long m = 0xc6a4a7935bd1e995L; int r = 47; long h = seed ^ (buf.remaining() * m); long k; while (buf.remaining() >= 8) { k = buf.getLong(); k *= m; k ^= k >>> r; k *= m; h ^= k; h *= m; } if (buf.remaining() > 0) { ByteBuffer finish = ByteBuffer.allocate(8).order( ByteOrder.LITTLE_ENDIAN); // for big-endian version, do this first: // finish.position(8-buf.remaining()); finish.put(buf).rewind(); h ^= finish.getLong(); h *= m; } h ^= h >>> r; h *= m; h ^= h >>> r; buf.order(byteOrder); return h; }}
hyperloglog實現
public class HyperLogLog { private static final int HLL_P = 14;//64位hash值中標記分組索引的bit數量,分組越多誤差越小,但佔用的空間越大 private static final int HLL_REGISTERS = 1 << HLL_P;//總的分組數量 private static final int HLL_BITS = 6;//為儲存每一個分組中最大起始0統計量,所需要的bit數量 private static final int HLL_REGISTER_MASK = (1 << HLL_BITS) - 1;//統計量的6位元遮罩 /** * bitmap儲存格式,採用小端儲存,先儲存最低有效位,然後儲存最高有效位 * +--------+--------+--------+------// //--+ * |11000000|22221111|33333322|55444444 .... | * +--------+--------+--------+------// //--+ */ private byte[] registers; public HyperLogLog() { //12288+1(12k)個位元組,最後一個額外的位元組相當於結束符,並沒有實際用途 registers = new byte[(HLL_REGISTERS * HLL_BITS + 7) / 8 + 1]; } //alpha係數,來自參考論文 private double alpha(int m) { switch (m) { case 16: return 0.673; case 32: return 0.697; case 64: return 0.709; default: return 0.7213 / (1 + 1.079 / m); } } //儲存第index分組的值為val private void setRegister(int index, int val) { int _byte = index * HLL_BITS / 8; int _fb = index * HLL_BITS & 7; int _fb8 = 8 - _fb; registers[_byte] &= ~(HLL_REGISTER_MASK << _fb); registers[_byte] |= val << _fb; registers[_byte + 1] &= ~(HLL_REGISTER_MASK >> _fb8); registers[_byte + 1] |= val >> _fb8; } //讀取第index分組的值 private int getRegister(int index) { int _byte = index * HLL_BITS / 8; int _fb = index * HLL_BITS & 7; int _fb8 = 8 - _fb; int b0 = registers[_byte] & 0xff; int b1 = registers[_byte + 1] & 0xff; return ((b0 >> _fb) | (b1 << _fb8)) & HLL_REGISTER_MASK; } public int hllAdd(int number) { long hash = MurmurHash.hash64(Integer.toString(number).getBytes()); long index = hash >>> (64 - HLL_P); int oldcount = getRegister((int) index); //計算hash值中從HLL_P為開始的連續0數量,包括最後一個1 hash |= 1l; long bit = 1l << (63 - HLL_P); int count = 1; while ((hash & bit) == 0l) { count++; bit >>= 1l; } if (count > oldcount) { setRegister((int) index, count); return 1; } else { return 0; } } //估算基數 public long hllCount() { //計算各分組統計量的調和平均數,SUM(2^-reg) double E = 0; int ez = 0; double m = HLL_REGISTERS; for (int i = 0; i < HLL_REGISTERS; i++) { int reg = getRegister(i); if (reg == 0) { ez++; } else { E += 1.0d / (1l << reg); } } E += ez; E = 1 / E * alpha((int) m) * m * m; if (E < m * 2.5 && ez != 0) { E = m * Math.log(m / ez); } else if (m == 16384 && E < 72000) { //來自redis源碼 double bias = 5.9119e-18 * E * E * E * E - 1.4253e-12 * E * E * E + 1.2940e-7 * E * E - 5.2921e-3 * E + 83.3216; E -= E * (bias / 100); } return (long) E; }}
測試
public class Test { //測試n個元素的集合 public static void testHyperLogLog(int n) { System.out.println("n = " + n); HyperLogLog hyperLogLog = new HyperLogLog(); Set<Integer> s = new HashSet<>(); Random random = new Random(); for (int i = 0; i < n; i++) { int number = random.nextInt(); hyperLogLog.hllAdd(number); s.add(number); } System.out.println("hyperLogLog count = " + hyperLogLog.hllCount()); System.out.println("hashset count = " + s.size()); System.out.println("error rate = " + Math.abs((double) hyperLogLog.hllCount() / s.size() - 1)); } public static void main(String[] args) { int n = 1; for (int i = 0; i < 9; i++) { n *= 10; testHyperLogLog(n); } }}
5.測試效果
n為產生的隨即元素總個數,第二行hyperLogLog count為hyperLogLog演算法估計的基數,hashset count為使用hashset統計出的精確結果,error rate為錯誤率。
可以看出大部分情況下hyperloglog演算法的錯誤率都在1%以內,當元素總個數達到1億時,hashset報出異常。
n = 10hyperLogLog count = 10hashset count = 10error rate = 0.0n = 100hyperLogLog count = 100hashset count = 100error rate = 0.0n = 1000hyperLogLog count = 1002hashset count = 1000error rate = 0.0020000000000000018n = 10000hyperLogLog count = 9974hashset count = 10000error rate = 0.0026000000000000467n = 100000hyperLogLog count = 100721hashset count = 99999error rate = 0.007220072200722072n = 1000000hyperLogLog count = 990325hashset count = 999883error rate = 0.00955911841685475n = 10000000hyperLogLog count = 9966476hashset count = 9988334error rate = 0.002188352932531057n = 100000000Exception in thread "main" java.lang.OutOfMemoryError: Java heap spaceat java.util.HashMap.resize(HashMap.java:703)at java.util.HashMap.putVal(HashMap.java:662)at java.util.HashMap.put(HashMap.java:611)at java.util.HashSet.add(HashSet.java:219)at com.sankuai.alg.Test.testHyperLogLog(Test.java:24)at com.sankuai.alg.Test.main(Test.java:36)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:497)at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)Process finished with exit code 1