A tutorial on using Python to make a simple, naïve cardinality estimator _python

Source: Internet
Author: User
Tags in python

Let's say you have a large dataset, very, very large, so that you can't put it all into memory. This dataset has duplicate data, you want to find out how many duplicate data, but the data is not sorted, because the amount of data is too large, so the ranking is impractical. How do you estimate how many data sets contain no duplicate data? This is useful in many applications, such as scheduling queries in a database: The best query plan depends not only on how much data is in total, but also on how much data it contains without duplication.

Before you continue reading, I will guide you to think a lot, because today we want to discuss the algorithm is very simple, but very creative, it is not so easy to come up with.
a simple base estimator for simplicity

Let's start with a simple example. Suppose someone generates data in the following ways:

    • Generate n fully dispersed random numbers
    • Arbitrarily select some numbers from it to repeat the time
    • Disrupt these numbers

How do we estimate the number of distinct numbers in the result dataset? Knowing that the original dataset is random and fully dispersed, a very simple method is to find the smallest number. If the maximum possible number is m, and the smallest value is x, we can estimate that there is probably a m/x number in the dataset. For example, if we scan a dataset with numbers between 0 and 1, we find that the smallest number is 0.01. We have reason to suspect that there are probably 100 distinct numbers in the dataset. If we find a smaller minimum, the number of data may be more. Note that no matter how many times each number repeats, this is natural, because how many times does the repetition not affect the min's output value.

The advantage of this process is very intuitive, but at the same time it is also very imprecise. It is not difficult to cite a counter example: A DataSet that contains only a few distinct numbers has a very small number. The same data set with many distinct numbers contains a much larger minimum than we thought, and this estimation method is also imprecise. Finally, few data are sufficiently dispersed to sufficiently random datasets. But the algorithmic prototype gives us some inspiration that we might be able to achieve our goals, and we need more refined algorithms.
Probability based counting

The first improvement came from Flajolet and Martin's paper probabilistic counting algorithms for Data Base applications. Further improvements from Durand-flajolet's paper loglog counting of large cardinalities and Flajolet et al's Papers hyperloglog:the analysis of a NEA R-optimal cardinality estimation algorithm. It's interesting to see how ideas are produced and improved from one paper to another, but my approach is slightly different, and I'll show you how to build and improve a workaround from scratch, omitting some of the algorithms in the original paper. Interested readers can read the three papers, which contains a lot of mathematical knowledge, I will not discuss in detail here.

First, Flajolet and Martin discovered that for arbitrary datasets, we could always give a good hash function so that the hashed dataset could be any sort of arrangement we needed. This is true even for fully dispersed (pseudo) random numbers. With this simple inspiration, we can turn the dataset we created before into the dataset we want, but that's far from enough.

Next, they found a better way to estimate the number of distinct numbers. Part of the method behaves better than the smallest hash value of the record. The estimation method used by Flajolet and Martin is the number of 0 words that calculate the header of the hash value. It is obvious that in a random dataset, an average of 0 bits of each 2^k element appear with a length of K. All we have to do is find these sequences and record the longest to estimate the number of distinct elements. However, this is still not a great estimator. It can only give us an estimate of the number of powers of 2. And unlike the least-valued estimation method, this method has a large variance. But in another respect, our estimate requires very little space: in order to record a leading 0-bit sequence with a maximum of 32 bits, we need only a 5-bit number.

Note: Flajolet-martin The original paper here continues to discuss a bitmap based process to obtain a more accurate estimate. I will not discuss this detail because it will soon be improved in the way that follows. More details for interested readers can read the original paper.

Now we have a bit-estimation method that is really bad. What improvements can we make? A direct idea is to use multiple independent hash functions. If each hash function outputs its own random dataset, we can record the longest leading 0-bit sequence. And then in the end we can get an average value for a more accurate estimate.

This gives us a pretty good result from an experimental statistic, but the cost of the hash is very high. A better way is to have a method called random averaging. Instead of using multiple hash functions, we use only one hash function. But the output is segmented and then used as a bucket number to put in a bucket in many barrels. Assuming we need 1024 values, we can use the first 10 bits of the hash function as the number of buckets, and then use the remaining hash values to compute the leading 0-bit sequences. This method does not lose precision, but it saves a lot of hash calculations.

The application we are learning now, here is a simple implementation. This is equivalent to the algorithm in Durand-flajolet's paper, so I calculate the tail's 0-bit sequence for ease and clarity. The result is completely equivalent.

 def trailing_zeroes (num):" "Counts the number of trailing 0 bits in num." ""
 If num = 0:return assumes bit integer inputs! p = 0 while (num >> p) & 1 = 0:p + + 1 return P def estimate_cardinality (values,k): "" "estimates the NUM
 
 ber of the unique elements in the input set values.
  Arguments:values:An iterator of hashable elements to estimate the cardinality of. K:the number of bits of hash to use as a bucket number;
 There would be 2**k buckets. "" "Num_buckets = 2 * * k max_zeroes = [0] * num_buckets for value in values:h = hash (value) bucket = h & (num_ BUCKETS-1) # Mask out the k least significant bits as bucket ID Bucket_hash = h >> k Max_zeroes[bucket] = max ( Max_zeroes[bucket],trailing_zeroes (Bucket_hash)) return 2 * * (Float (sum (max_zeroes))/num_buckets) * num_buckets * 0.79 402 

It's beautiful, as we've described: we keep an array of 0 numbers of leading (or trailing) digits, and then we'll average the numbers at the end, and if our average is X, our estimate is 2^x times the number of buckets. What's not mentioned is the magic number 0.79402. Data statistics show that there is a predictable deviation in our program, which gives us a larger estimate than the actual value. The magic constants derived from this paper in Durand-flajolet are used to correct this deviation. In fact, this number varies with the number of buckets used (maximum 2^64), but for a larger number of buckets, it converges to the estimate of the algorithm used above. For a lot more information, see the complete paper, including how the magic number was derived.

This program gives us a very good estimate, for the M bucket, the average error rate is about 1.3/SQRT (m). So 1024 barrels (), we will probably have 4% of the expected error rate. In order to estimate a data set of up to 2^27 data each bucket needs only 5 bits. Less than 1 KB of memory, which is really awesome (1024 * 5 = 5120, or 640 bytes)!

Let's test it on some random data:

>>> [100000/estimate_cardinality ([Random.random () for I in range (100000)],10) for J in Range (Ten)]
[ 0.9825616152548807,0.9905752876839672,0.979241749110407,1.050662616357679,0.937090578752079,0.9878968276629505,0.98123232 03117748,1.0456960262467019,0.9415413413873975,0.9608567203911741]

The results are not bad, some estimates are more than 4% of the expected deviations, but in short, the results are good. If you try this experiment again, please note that the hash () function in Python hashes the integers into themselves. Causing a run like estimate_cardinality (range (10000), 10) gives a large deviation because the hash () at this time is not a good hash function. Of course, using the random number in the above example is no problem.
Improved accuracy: Superloglog and Hyperloglog

Although we have got a very good estimate, it is possible to do better. Durand and Flajolet find that extreme values can greatly affect the accuracy of the estimates. The accuracy can be improved by discarding some of the maximum values before averaging. In particular, the first 30% barrels are discarded, the average of 70% barrels is calculated, and the accuracy can be raised to 1.05/sqrt (m) with 1.30/SQRT (m)! This means that in our previous example, with a 640-byte state, the average error rate changed from 4% to about 3.2%. But it does not increase the use of space.

Finally, the contribution of Flajolet et al's paper is the use of a different type of average. Use harmonic averages rather than geometric averages. By doing so, we can reduce the error rate to 1.04/SQRT (m), without increasing the space required. The complete algorithm, of course, is a bit more complicated because it has to fix small and large cardinality errors. Interested readers should, as you may have guessed, are going to read the complete paper.
parallelization

The uniformity of these schemes makes it easy to be parallel. Multiple machines can run the same number of buckets independently of the same hash function. In the end we only need to combine the results, take each algorithm instance each bucket maximum value is OK. This is not only a good implementation, because we only need to transfer less than 1kb of data, and the results of running on a single machine is exactly the same.
Summary

The

is like the cardinality sorting algorithm we've just discussed, making it possible to get a good estimate of the number of distinct numbers. Usually only 1kb space is not used. We can use it without relying on the kinds of data, and we can work on multiple machines in a distributed way, and the coordination between machines and the transmission of data is minimal. The resulting estimates can be used to do many things, such as traffic monitoring (how many independent IP accesses?). and database query optimization (should we sort and merge or construct a hash table?) )。

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.