A tutorial on making a simple naïve cardinality estimator with Python

Source: Internet
Author: User
Suppose you have a large data set that is very, very large, and cannot be fully stored in memory. This data set has duplicate data, you want to find out how many duplicate data, but the data is not sorted, because the amount of data is too large, so the sort is impractical. How do you estimate how much of the data set contains no duplicates? This is useful in many applications, such as a planned query in a database: The best query plan depends not only on how much data is in total, but also on how much data it contains that is not duplicated.

Before you continue reading, I will lead you to think a lot, because the algorithm we are discussing today is very simple, but very creative, it is not so easy to think out.
A simple naïve cardinality estimator

Let's start with a simple example. Assume that someone generates data in the following ways:

    • Generate n fully dispersed random numbers
    • Arbitrarily select a number from which to repeat the
    • Disrupt these numbers.

How do we estimate the number of non-duplicated numbers in the result data set? Knowing that the original data set is random and fully dispersed, a very simple approach is to find the smallest number. If the maximum possible value is m, the minimum value is x, we can estimate that there are probably m/x of non-repeating numbers inside the dataset. For example, if we scan a data set between 0 and 1, we find that the smallest number is 0.01. We have reason to suspect that there are probably 100 distinct numbers in the data set. If we find a smaller minimum, the number of data may be more. Keep in mind that no matter how many times each number repeats, it's natural, because the number of repetitions does not affect the output value of min.

The advantage of this process is that it is very intuitive, but it is also very imprecise. It is not difficult to cite a counter example: A DataSet that contains only a few distinct numbers has a small number. The same data set with many non-repeating numbers contains a much larger minimum than we thought, and this estimation method can be very imprecise. Finally, there is very little data to fully disperse a sufficiently random set of data. But this algorithm prototype gives us some inspiration so that we can achieve our goal, we need a more refined algorithm.
Probability-based counting

The first improvements came from Flajolet and Martin's paper probabilistic counting algorithms for Data Base applications. Further improvements from Durand-flajolet's thesis Loglog Counting of large cardinalities and Flajolet et al's thesis hyperloglog:the Analysis of a NEA R-optimal cardinality estimation algorithm. It's interesting to see how ideas are produced and improved from one paper to another, but my approach is slightly different, and I'll show you how to build and improve a solution from scratch, omitting some of the algorithms in the original paper. Interested readers can read the three papers, the paper contains a lot of mathematical knowledge, I will not discuss in detail here.

First, Flajolet and Martin found that for any data set, we could always give a good hash function so that the hashed dataset could be any sort of arrangement we needed. This is true even for fully dispersed (pseudo) random numbers. With this simple inspiration, we can turn our previously generated datasets into the datasets we want, but that's far from enough.

Next, they found that there was a better way to estimate the number of distinct numbers. The partial method behaves better than the hash value with the smallest record. The estimation method used by Flajolet and Martin is to calculate the number of 0 words at the header of the hashed value. Obviously in a random data set, an average of 0 bit sequences of length k are present for every 2^k element. All we have to do is find these sequences and record the longest to estimate the number of non-repeating elements. However, this is still not a great estimator. It can only give us an estimate of the number of powers of 2. And unlike the least-valued estimation method, the variance of this method is very large. But on the other hand, our estimates require very little space: to record a leading 0-bit sequence of up to 32 bits, we just need a 5-bit number.

Note: Flajolet-martin The original paper continues to discuss a bitmap-based process to obtain a more accurate estimate. I will not discuss this detail because it will soon be improved in the next method. More details for interested readers can read the original paper.

Now we have a really bad bit-estimation method. What improvements can we make? A straightforward idea is to use multiple independent hash functions. If each hash function outputs its own random data set, we can record the longest leading 0-bit sequence. And then at the end we can get an average for a more accurate estimate.

This gives us a fairly good result from the experimental statistics, but the cost of hashing is very high. A better way is a method called random averaging. Instead of using multiple hash functions, we only use a hash function. But it splits its output and uses a portion of it as a bucket sequence to put it in a bucket in many buckets. Suppose we need 1024 values, we can use the first 10 bits of the hash function as the ordinal of the bucket, and then use the remaining hash value to calculate the leading 0-bit sequence. This method does not lose precision, but it saves a lot of hash calculations.

To apply what we've learned so far, here's a simple implementation. This is equivalent to the algorithm in Durand-flajolet's paper, so I calculate the trailing 0-bit sequence in order to achieve convenience and clarity. The result is completely equivalent.

def trailing_zeroes (num): "" "Counts the number of trailing 0 bits in num." "If num = = 0:  return # assumes Nteger inputs!  p = 0 while (num >> p) & 1 = = 0:  p + = 1 return P def estimate_cardinality (values,k): "" "estimates the number of unique elements in the input set values.  Arguments:  Values:an iterator of hashable elements to estimate the cardinality of.  K:the number of bits of hash to use as a bucket number; There'll be 2**k buckets. "" "Num_buckets = 2 * * k max_zeroes = [0] * num_buckets for value in values:  h = hash (value)  bucket = h & (num  _BUCKETS-1) # Mask out the k least significant bits as bucket ID  Bucket_hash = h >> k  Max_zeroes[bucket] = Max (Max_zeroes[bucket],trailing_zeroes (Bucket_hash)) return 2 * * (Float (sum (max_zeroes))/num_buckets) * Num_buckets * 0.79402

It's pretty, like we're describing: we keep an array that calculates the leading (or trailing) 0 numbers, and then averages the numbers at the end, and if our average is X, our estimate is 2^x times the number of buckets. What is not mentioned above is this magic number 0.79402. Data statistics show that there is a predictable deviation in our program, which gives a larger estimate than the actual value. The magic constants derived in Durand-flajolet's paper are used to correct this deviation. In fact, this number changes with the number of buckets used (maximum 2^64), but for a larger number of buckets, it converges to the estimate of the algorithm we used above. For a lot more information, see the full paper, including how the magic number was exported.

This procedure gives us a very good estimate, for M-barrels, the average error rate is about 1.3/SQRT (m). So when 1024 barrels (), we will probably have 4% of the expected error rate. In order to estimate the data set of up to 2^27 data, only 5 bits per bucket is sufficient. Less than 1 KB of memory, which is really great (1024 * 5 = 5120, or 640 bytes)!

Let's test it on some random data:

>>> [100000/estimate_cardinality (Random.random () for I in range (100000)],10) for J in Range (10)] [ 0.9825616152548807,0.9905752876839672,0.979241749110407,1.050662616357679,0.937090578752079,0.9878968276629505,0.98123232 03117748,1.0456960262467019,0.9415413413873975,0.9608567203911741]

The results are not bad, some estimates exceed 4% of the expected deviation, but the overall result is good. If you try this experiment yourself again, note that the Python built-in hash () function hashes integers to themselves. The result of running like estimate_cardinality (range (10000), 10) gives a very large deviation, because the hash () is not a good hash function at this time. Of course, using the random numbers in the above example is no problem.
Improved accuracy: Superloglog and Hyperloglog

Although we have got a very good estimate, it is possible to do better. Durand and Flajolet found that extreme values can greatly affect the accuracy of the estimated results. The accuracy can be improved by discarding some of the maximum values before averaging. In particular, discard the top 30% barrels, only calculate the average of 70% barrels, the accuracy can be increased by 1.30/sqrt (m) to 1.05/SQRT (m)! This means that in our previous example, with a state of 640 bytes, the average error rate has changed from 4% to about 3.2%. But the use of space is not increased.

Finally, the contribution of Flajolet et al's paper is to use a different type of average. Use harmonic averages rather than geometric averages. By doing so, we can reduce the error rate to 1.04/SQRT (m), without increasing the space needed. Of course the complete algorithm is a bit more complicated because it has to correct small and large cardinality errors. Interested readers should, as you might have guessed, read the full paper.
Parallelization of

The uniformity common to these schemes makes it easy to parallelize. Multiple machines can run the same number of buckets independently of the same hash function. We only need to combine the results at the end, taking the maximum value of each bucket in each algorithm instance. This is not only a good implementation, because we only need to transfer less than 1kb of data, and the results of running on a single machine is exactly the same.
Summarize

As we have just discussed, the cardinality sorting algorithm makes it possible to get a good estimate of the number of distinct digits. Typically only less than 1kb of space is used. We can use it without relying on the kind of data, and can work on multiple machines in a distributed way, and the coordination between machines and the transmission of data is minimized. Results estimates can be used to do many things, such as traffic monitoring (how many independent IPs have been accessed?). and database query optimization (should we sort and then merge or construct a hash table?) )。

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.