Create a simple base estimator using Python and a python base Estimator

Source: Internet
Author: User

Create a simple base estimator using Python and a python base Estimator

Suppose you have a large dataset, which is so large that you cannot store all data in the memory. This dataset contains duplicate data. You want to find out how many duplicate data are there, but the data is not sorted. Because the data volume is too large, sorting is impractical. How do you estimate the amount of non-duplicate data contained in the data set? This is useful in many applications. For example, the Plan query in the database: The best query plan depends not only on the total amount of data, but also on the amount of non-duplicated data it contains.

Before you continue to read it, I will guide you to think a lot, because the algorithm we are discussing today is very simple, but very creative. It is not so easy to come up.
A simple base Estimator

Let's start with a simple example. Assume that someone generates data using the following column:

  • Generate n fully dispersed random numbers
  • Select any number from it to repeat it.
  • Disrupt these numbers

How can we estimate the number of non-repeated numbers in the result dataset? We know that the original dataset is a random number and is fully scattered. A very simple method is to find the smallest number. If the maximum possible value is m and the minimum value is x, we can estimate that there are approximately m/x non-repeated numbers in the dataset. For example, if we scan a dataset with numbers ranging from 0 to 1, we find that the minimum number is 0.01. We have reason to guess that there may be about 100 non-repeated numbers in the dataset. If we find a smaller minimum value, it may include more data. Please note that no matter how many times each number is repeated, it is natural because the number of repeated times does not affect it? Min? .

The advantage of this process is that it is very intuitive, but it is also very inaccurate. It is not difficult to cite a counterexample: a dataset containing only a few non-repeating numbers contains a very small number. Similarly, a dataset containing many non-repeated numbers contains a larger minimum value than we think, and this estimation method will be very inaccurate. Finally, few data are fully scattered and sufficiently random. However, this algorithm prototype gives us some inspiration to make it possible for us to achieve our goal. We need more refined algorithms.
Probability-based count

The first improvement is from the Probabilistic Counting Algorithms for Data Base Applications. Further improve HyperLogLog counting of large cardinalities and Flajolet et al papers from Durand-Flajolet: The analysis of a near-optimal cardinality estimation algorithm. It is interesting to observe the generation and improvement of ideas from one paper to another, but my approach is slightly different. I will demonstrate how to build and improve a solution from scratch, the algorithms in some original papers are omitted. Interested readers can read the three papers, which contain a lot of mathematical knowledge. I will not discuss them in detail here.

First, Flajolet and Martin found that we can always give a good hash function for any dataset, so that the hash dataset can be any sort we need. This is even true for fully dispersed (pseudo) random numbers. With this simple inspiration, we can convert the datasets we have previously produced into the ones we want, but this is far from enough.

Next, they found that there is a better way to estimate the number of non-duplicates. Some methods are better than the minimum record hash value. The estimation method used by Flajolet and Martin is to calculate the number of 0 characters in the header of the hash value. Obviously, in a random data set, an average of 2 ^ k elements shows a bit sequence with a length of k and a full value of 0. What we need to do is to find these sequences and record the longest to estimate the number of non-repeating elements. However, this is still not a great estimator. It can only give us an estimate of the number of power 2 at most. Unlike the minimum-value-based estimation method, this method has a large variance. However, in another aspect, our estimation requires very little space: to record the maximum 32-bit leading 0-bit sequence, we only need a 5-bit number.

Note: The original paper of Flajolet-Martin continues to discuss a bitmap-based process to obtain a more accurate estimate. I will not discuss this details, because it will be improved in the subsequent method immediately. For more details, read the original paper.

Now we get a really bad bit estimation method. What improvements can we make? A direct idea is to use multiple independent hash functions. If each hash function? Output its own random dataset. We can record the longest leading 0-bit sequence. Then, at the end, we can calculate an average value to get a more accurate estimate.

From the experiment statistics, this gives us a very good result, but the cost of hashing is very high. A better method is a method called random mean. Compared to multiple hash functions, we only use one hash function. However, it splits the output and uses a part of it as the bucket serial number to put it in one bucket in many buckets. Suppose we need 1024 values. We can use the first 10 bits of the hash function as the serial number of the bucket, and then use the remaining hash value to calculate the first 0 bits. This method does not compromise accuracy, but saves a lot of hash computing.

Here is a simple implementation of the application we have learned. This is equivalent to the algorithm in the Durand-Flajolet thesis. To achieve convenience and clarity, I calculated the 0-bit sequence at the end. The results are completely equivalent.
 

def trailing_zeroes(num): """Counts the number of trailing 0 bits in num.""" if num == 0:  return 32 # Assumes 32 bit integer inputs! p = 0 while (num >> p) & 1 == 0:  p += 1 return p def estimate_cardinality(values,k): """Estimates the number of unique elements in the input set values.  Arguments:  values:An iterator of hashable elements to estimate the cardinality of.  k:The number of bits of hash to use as a bucket number; there will be 2**k buckets. """ num_buckets = 2 ** k max_zeroes = [0] * num_buckets for value in values:  h = hash(value)  bucket = h & (num_buckets - 1) # Mask out the k least significant bits as bucket ID  bucket_hash = h >> k  max_zeroes[bucket] = max(max_zeroes[bucket],trailing_zeroes(bucket_hash)) return 2 ** (float(sum(max_zeroes)) / num_buckets) * num_buckets * 0.79402

This is pretty as we described: We keep an array that calculates the number of 0 leading (or trailing) values, and then calculate the average value of the number at the end. If our average value is x, our estimation is 2 ^ x multiplied by the number of buckets. What I didn't talk about above is that the magic number is 0.79402. Statistics show that our program has a predictable deviation, which gives a larger estimate than the actual one. The magic constant derived from this paper in Durand-Flajolet is used to correct this deviation. In fact, this number changes with the number of buckets used (maximum 2 ^ 64), but for a larger number of buckets, it will converge to the estimated number of algorithms used above. For more information, see the complete paper, including how the magic number is exported.

This program gives us a very good estimate. For m buckets, the average error rate is about 1.3/sqrt (m. Therefore, when there are 1024 buckets (), we may have an expected error rate of 4%. To estimate that each data set with a maximum of 2 ^ 27 data entries requires only 5 bits per bucket. Less than 1 kb of memory, which is really good (1024*5 = 5120, that is, 640 bytes )!

Let's test it on some random data:
 

>>> [100000/estimate_cardinality([random.random() for i in range(100000)],10) for j in range(10)][0.9825616152548807,0.9905752876839672,0.979241749110407,1.050662616357679,0.937090578752079,0.9878968276629505,0.9812323203117748,1.0456960262467019,0.9415413413873975,0.9608567203911741]

The results are not bad. Some estimates exceed 4% of the expected deviations, but all in all, the results are good. If you try this experiment again, note that the built-in hash () function in Python hashing integers into themselves. As a result, running such as estimate_cardinality (range (10000), 10) will result in a large deviation, because hash () is not a good hash function at this time. Of course, using the random number in the above example is correct.
Improved accuracy: SuperLogLog and HyperLogLog

Although we have obtained a very good estimate, it may be better. Durand and Flajolet discover that extreme values greatly affect the accuracy of the estimation results. Accuracy can be improved by dropping the maximum value before the average value. In particular, we discard the first 30% buckets and only calculate the average value of 70% buckets. The accuracy can be increased to 1.30/sqrt (m) with 1.05/sqrt (m )! This means that in our previous example, with a 640-byte state, the average error rate has changed from 4% to about 3.2%. But no space is added.

Finally, the contribution of the Flajolet et al paper is to use an average of different types. Use harmonic mean instead of geometric mean. By doing so, we can reduce the error rate to 1.04/sqrt (m) without increasing the required space. Of course, a complete algorithm must be more complex because it must correct small and large base errors. Interested readers should, as you may have guessed, read the complete paper.
Parallelization

The uniformity shared by these solutions makes it easy to parallelize them. Multiple machines can run the same number of buckets with the same hash function. In the end, we only need to combine the results to obtain the maximum value of each bucket in each algorithm instance. This is not only a good implementation, because we only need to transmit less than 1 kb of data at most, and the results are exactly the same as running on a single machine.
Summary

Just like the base sorting algorithm we have just discussed, it is possible to get a good estimate of the number of non-repeated numbers. Generally, only 1 kb is used. We can use it without depending on the type of data, and can work on multiple machines in a distributed manner. The coordination between machines and data transmission reach the minimum. Result estimates can be used to do many things, such as traffic monitoring (How many independent IP addresses have been accessed ?) And database query optimization (Should we sort and merge them, or create a hash table ?).

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.