Hyper Loglog algorithm for high compression space occupancy

Source: Internet
Author: User
Tags ip number natural logarithm redis unique id

Large data calculation: How to use only 1.5KB memory as 1 billion object count-Hyper loglog algorithm

Big Data counting:how to Count A billion Distinct Objects Using 1.5K

This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the Cardinality of sets with billions of distinct elements using surprisingly data small. Their servers receive the billion events per month.

In Clearspring, we are engaged in statistical data. It is a challenge to count a set of different elements and a large number of datasets.

To better understand the challenges of large datasets with clear cardinality, let's assume that your log file contains 16-character IDs, and you want to count the number of different IDs. For example:

4f67bfc603106cb2

These 16 characters need to be represented in 128 bits. 65,000 IDs will require 1MB of space. We receive 30多亿条 event records every day, each with an ID. These IDs require 384 billion-bit or 45GB of storage. And that's just the space the ID field needs. We take an easy way to get the data in the daily event record with the base ID. The easiest way to do this is to use a hash set and store it in memory, where the hash set contains a list of unique IDs (that is, the IDs for multiple records in the input file are the same, but only one in the hash set). Even if we assume that only 1/3 of the record IDs are unique (that is, 2/3 of the record IDs are duplicates), the hash set still requires 119GB of RAM, which does not include the overhead of Java needing to store objects in memory. You need a machine with hundreds of GB of RAM to compute different elements, and this is just a memory consumption that calculates the unique ID of the log event record in a day. If we want to count data for weeks or months, the problem will only become more difficult. We certainly don't have a free machine with hundreds of GB of RAM, so we need a better solution.

A common way to solve this problem is to use bitmaps (blog: A massive data processing algorithm-bit-map). A bitmap can quickly and accurately get the cardinality of a given input. The basic idea of a bitmap is to use a hash function to map a data set to a bit bit, and each INPUT element is one by one corresponding to the bit bits. This hash will not produce collision conflicts and reduce the need to compute each element mapping to 1 bit space. Although Bit-map saves a lot of storage space, they still have problems when it counts very high cardinality or very large sets of different datasets. For example, if we want to use bit-map billions of, you will need to bit-map bit, or need each counter MB. Sparse bitmaps can be compressed to achieve more space efficiency, but they are not always helpful.

Fortunately, cardinality estimation is a hot research area. We have used this study to provide an open source implementation of cardinality estimates, set element detection and TOP-K algorithms.

The cardinality estimation algorithm is the use of accuracy in exchange for space. To illustrate this point, we use three different computational methods to count the number of different words in all Shakespeare's works. Please note that our input dataset adds additional data to the problem with a higher reference base. These three technologies are: Java HashSet, Linear probabilistic Counter, and a hyper loglog. The results are as follows:

The table shows that we count these words only with a bytes, and the error is less than 3%. By contrast, HashMap has the highest count accuracy, but requires nearly 10MB of space, and you can easily see why cardinality estimates are useful. In practical applications, the accuracy is not very important, this is the fact that, in most network size and network computing, the probability counter will save huge space.

Linear probability counter

A linear probability counter is an efficient use of space and allows the implementation to specify the desired level of precision. The algorithm is useful when focusing on space efficiency, but you need to be able to control the error of the result. The algorithm runs in two steps: The first step is to allocate a bit-map in memory that is initialized to all 0, and then use the hash function to hash each entry in the input data, and the result of the hash function operation is to map each record (or element) to a bit bit in the Bit-map, The bit bit is set to 1; the second step, the algorithm calculates the number of bits that are empty, and uses this number to enter the following formula to estimate:

N=-m LN Vn

Note: The natural logarithm of ln vn=loge (Vn)

In the formula, M is the size of the Bit-map, and the VN is the ratio of the empty bit bit to the size of the map. It is important to note that the size of the original bit-map can be much smaller than the expected maximum cardinality. How much smaller depends on the size of the error you can withstand. Because the size m of the bit-map is smaller than the total number of different elements, collisions will occur. Although collisions can save space, they also cause errors in the estimation results. So by controlling the size of the original map, we can estimate the number of collisions so that we will see how much error there is in the final result.

Hyper Loglog

As the name suggests, the Hyper loglog counter is a dataset that estimates the cardinality of the Nmax only with Loglog (Nmax) +o (1) bits. A Hyper Loglog counter, such as a linear counter, allows a designer to specify the desired precision value, in the case of Hyper Loglog, by defining the required relative standard deviation and the maximum cardinality expected to be counted. Most counters work by using an input stream m and applying a hash function to set H (m). This produces an observable result of an S = H (M) of {0,1}^∞ string. By splitting the hash input into an M substring, and maintaining the value of M for each child input stream is observable, this is quite a new hyper Loglog (a sub M is a new hyper Loglog). Using the average value of the additional observations, a counter is generated with the precision increasing with m, which requires only a few steps to perform for each element in the input set. As a result, this counter can use only the 1 billion different data elements of 1.5 KB with a space calculation accuracy of 2%. This algorithm is more efficient than the 120 megabytes required to perform hashset.

Merging distributed counters

We have shown that we can estimate the cardinality of a large set by using the counter described above. However, if your original input dataset is not suitable for a single machine, what will you do? This is the problem we are facing in clearspring. Our data is dispersed on hundreds of servers, and each server contains only a subset of the entire set of data. The fact that we can combine the contents of a distributed set of counters is essential. The idea is a bit confusing, but if you spend some time thinking about it, you'll find that it's not much different than the basic cardinality estimate. Because this counter represents the bits in the map as cardinality, we can take two compatible counters and merge their bit bits onto a single map. This algorithm has been dealt with collisions, so we can get a cardinality estimate of the precision required, even if we never put all the input data into a machine. This is very useful and saves us a lot of time and effort in moving data around the network.

Next Steps

Hopefully this article will help you to better understand this concept and the application of the probability counter. If estimating the cardinality of a large set is a problem and you happen to use a JVM based language, you should use the Stream-lib project-it provides several other flow-processing tools and implementations of the algorithms described above.

This article is from: High scalability

If you have a deeper understanding of hyper Loglog:

Http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Http://www.ic.unicamp.br/~celio/peer2peer/math/bitmap-algorithms/durand03loglog.pdf


Article Source: http://blog.csdn.net/hguisu/article/details/8433731

=================================================================================================

Elementary Introduction to Hyperloglog algorithm

The purpose of this algorithm:

Like giving you an array, int a[]={1,1,2,6,9,8,5,4,1,2}

There are 10 elements in this array, of which the number of distinct is 7, and they are 1,2,4,5,6,8,9

The algorithm is to determine how many different elements are in the input stream.

This algorithm is a probabilistic algorithm, but its accuracy is very high, the following is its description and implementation details

We first need some of the following auxiliary functions or data

1,int hash (type input); The input element is hashed into a 32bit integer, the input may be an integer, it may be a string, or even a struct, etc

2,unsigned int position (int input); Returns the position in the binary representation of input, from left to right, where the first 1 appears

Like what

Position (1000000100000111110) =1

Position (0001111000011100000) =4

Position (000000) =7

3,m=2^b, where b is between [4,16]

4, several constants

Const Double a16=0.673

Const Double a32=0.697

Const Double a64=0.709

Const double am=0.7213/(1+1.079/m) (m>=128)

With these four preparations, we can start counting with Hyperloglog.

M=2^b counters, m[1] to M[m] are initialized to 0

for (V=input)

{

X=hash (v);

j=1+<x1x2...xb> (binary)

W=x (b+1) x (b+2) ... x32

M[j]=max (M[j],position (w));

}

Res=am*m^2*s (1,m,2^ (-M[J))

Source: http://www.java123.net/v/356202.html


=================================================================================================

REDIS Data Structure Hyperloglog



If we're going to implement a feature that records the number of independent IP sites accessed every day


Collection implementation:

Using collections to store IP for each visitor, you can get multiple independent IPs by the nature of the collection (each element in the collection is different).
The number of independent IPs is then obtained by invoking the SCard command.
For example, a program can use the following code to record the IP of each website visitor for August 15, 2014:
ip = get_vistor_ip ()
Sadd ' 2014.8.15::unique::ip ' IP
Then use the following code to get the unique IP number for the day:
SCard ' 2014.8.15::unique::ip '


Problems with SET implementations

Using strings to store each IPV4 address can take up to 15 bytes (in the form of ' XXX.XXX.XXX. XXX ', for example
' 202.189.128.186 ').
The following table shows the amount of memory that is required when using a collection to record different numbers of independent IP:
Independent IP Number one months a year
1,000,015 MB 450 MB 5.4 GB
10,000,150 MB 4.5 GB GB
100,000,001.5 GB GB 540 GB
As the collection records more IP, more memory is consumed.
In addition, if you want to store the IPV6 address, you will need more memory


To better address such problems as independent IP address computing,
Redis added a hyperloglog structure in version 2.8.9.


Hyperloglog Introduction

Hyperloglog can accept multiple elements as input and give the cardinality estimate of the INPUT element:
• Cardinality: The number of different elements in the collection. For example, {' Apple ', ' banana ', ' cherry ', ' banana ', ' apple '} 's base is 3.
• Estimated value: The cardinality given by the algorithm is not accurate, may be slightly more or less than the actual, but will control the
Within the scope of the rationale.
The advantage of Hyperloglog is that even if the number or volume of input elements is very large, the space required to compute the cardinality is always fixed
, and is very small.
In Redis, each Hyperloglog key takes only a few kilobytes of memory to compute a base that is close to 2^64 different elements.
Number. This is in stark contrast to the number of elements that consume more memory as the base is computed.
However, because Hyperloglog only calculates the cardinality based on the input elements and does not store the input elements themselves,
Hyperloglog cannot return each element of the input as a collection.


Add an element to Hyperloglog


Pfadd key element [element ...]
Adds any number of elements to the specified hyperloglog.
This command may modify the Hyperloglog to reflect the new cardinality estimate, if the base estimate of the Hyperloglog
Value is changed after the command is executed, the command returns 1, otherwise it returns 0.
The complexity of the command is O (n), and N is the number of elements added.

Returns the cardinality estimate for a given hyperloglog
Pfcount key [Key ...]
When only one hyperloglog is given, the command returns the cardinality estimate for the given Hyperloglog.
When given more than one hyperloglog, the command first calculates the set of Hyperloglog for the given number, and obtains a merged
Hyperloglog, and then returns the base estimate of the merged hyperloglog as the result of the command (merged
Hyperloglog will not be stored and will be deleted after use.
When the command acts on a single hyperloglog, the complexity is O (1) and has a very low mean constant time.
When the command acts on multiple hyperloglog, the complexity is O (N), and constant time is also greater than when processing a single hyperloglog
Much bigger.

Examples of the use of Pfadd and Pfcount


redis> pfadd unique::ip::counter ' 192.168.0.1 '
(integer) 1
redis> pfadd unique::ip::counter ' 127.0.0.1 '
(integer) 1
redis> pfadd unique::ip::counter ' 255.255.255.255 '
(integer) 1
Redis> Pfcount Unique::ip::counter
(integer) 3


Merging multiple Hyperloglog


Pfmerge destkey Sourcekey [Sourcekey ...]
Combining multiple hyperloglog into one hyperloglog, the combined Hyperloglog cardinality estimate is passed on to all
Given the hyperloglog of a set of calculations.
The complexity of the command is O (n), where n is the number of Hyperloglog that are merged, although this command has a higher constant complexity.

Examples of the use of Pfmerge
redis> pfadd str1 "apple" "Banana" "Cherry"
(integer) 1
Redis> Pfcount str1
(integer) 3
redis> pfadd str2 "Apple" "Cherry" "Durian" "MONGO"
(integer) 1
Redis> Pfcount str2
(integer) 4
redis> pfmerge str1&2 str1 str2
Ok
Redis> Pfcount str1&2
(integer) 5

Hyperloglog implement independent IP computing function

Independent IP Number one months a year (using a collection)
1,000,012 KB 360 KB 4.32 MB 5.4 GB
10,000,012 KB 360 KB 4.32 MB GB
100,000,012 KB 360 KB 4.32 MB 540 GB
The following table lists the amount of memory that is consumed when using Hyperloglog to record different numbers of independent IP:
As you can see, to count the same number of independent IPs, Hyperloglog requires much less memory than the collection.

Source: http://www.cnblogs.com/ysuzhaixuefei/p/4052110.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.