To better understand the challenges of large datasets with clear cardinality, let's assume that your log file contains 16-character IDs, and you want to count the number of different IDs. For example:
4f67bfc603106cb2
These 16 characters need to be represented in 128 bits. 65,000 IDs will require 1MB of space. We receive 30多亿条 event records every day, each with an ID. These IDs require 384 billion-bit or 45GB of storage. And that's just the space the ID field needs. We take an easy way to get the data in the daily event record with the base ID. The easiest way to do this is to use a hash set and store it in memory, where the hash set contains a list of unique IDs (that is, the IDs for multiple records in the input file are the same, but only one in the hash set). Even if we assume that only 1/3 of the record IDs are unique (that is, 2/3 of the record IDs are duplicates), the hash set still requires 119GB of RAM, which does not include the overhead of Java needing to store objects in memory. You need a machine with hundreds of GB of RAM to compute different elements, and this is just a memory consumption that calculates the unique ID of the log event record in a day. If we want to count data for weeks or months, the problem will only become more difficult. We certainly don't have a free machine with hundreds of GB of RAM, so we need a better solution.
A common way to solve this problem is to use bitmaps (blog: A massive data processing algorithm-bit-map). A bitmap can quickly and accurately get the cardinality of a given input. The basic idea of a bitmap is to use a hash function to map a data set to a bit bit, and each INPUT element is one by one corresponding to the bit bits. This hash will not produce collision conflicts and reduce the need to compute each element mapping to 1 bit space. Although Bit-map saves a lot of storage space, they still have problems when it counts very high cardinality or very large sets of different datasets. For example, if we want to use bit-map billions of, you will need to bit-map bit, or need each counter MB. Sparse bitmaps can be compressed to achieve more space efficiency, but they are not always helpful.
Fortunately, cardinality estimation is a hot research area. We have used this study to provide an open source implementation of cardinality estimates, set element detection and TOP-K algorithms.
The cardinality estimation algorithm is the use of accuracy in exchange for space. To illustrate this point, we use three different computational methods to count the number of different words in all Shakespeare's works. Please note that our input dataset adds additional data to the problem with a higher reference base. These three technologies are: Java HashSet, Linear probabilistic Counter, and a hyper loglog. The results are as follows:
The table shows that we count these words only with a bytes, and the error is less than 3%. By contrast, HashMap has the highest count accuracy, but requires nearly 10MB of space, and you can easily see why cardinality estimates are useful. In practical applications, the accuracy is not very important, this is the fact that, in most network size and network computing, the probability counter will save huge space.
Linear probability counter
A linear probability counter is an efficient use of space and allows the implementation to specify the desired level of precision. The algorithm is useful when focusing on space efficiency, but you need to be able to control the error of the result. The algorithm runs in two steps: The first step is to allocate a bit-map in memory that is initialized to all 0, and then use the hash function to hash each entry in the input data, and the result of the hash function operation is to map each record (or element) to a bit bit in the Bit-map, The bit bit is set to 1; the second step, the algorithm calculates the number of bits that are empty, and uses this number to enter the following formula to estimate:
N=-m LN Vn
In the formula, M is the size of the Bit-map, and the VN is the ratio of the empty bit bit to the size of the map. It is important to note that the size of the original bit-map can be much smaller than the expected maximum cardinality. How much smaller depends on the size of the error you can withstand. Because the size m of the bit-map is smaller than the total number of different elements, collisions will occur. Although collisions can save space, they also cause errors in the estimation results. So by controlling the size of the original map, we can estimate the number of collisions so that we will see how much error there is in the final result.
Hyper Loglog
As the name suggests, the Hyper loglog counter is a dataset that estimates the cardinality of the Nmax only with Loglog (Nmax) +o (1) bits. A Hyper Loglog counter, such as a linear counter, allows a designer to specify the desired precision value, in the case of Hyper Loglog, by defining the required relative standard deviation and the maximum cardinality expected to be counted. Most counters work by using an input stream m and applying a hash function to set H (m). This produces an observable result of an S = H (M) of {0,1}^∞ string. By splitting the hash input into an M substring, and maintaining the value of M for each child input stream is observable, this is quite a new hyper Loglog (a sub M is a new hyper Loglog). Using the average value of the additional observations, a counter is generated with the precision increasing with m, which requires only a few steps to perform for each element in the input set. As a result, this counter can use only the 1 billion different data elements of 1.5 KB with a space calculation accuracy of 2%. This algorithm is more efficient than the 120 megabytes required to perform hashset.
Merging distributed counters
We have shown that we can estimate the cardinality of a large set by using the counter described above. But what would you do if your original input dataset was not suitable for a single machine? That is the problem we are facing in clearspring. Our data is dispersed on hundreds of servers, and each server contains only a subset of the entire set of data. The fact that we can combine the contents of a distributed set of counters is essential. The idea is a bit confusing, but if you spend some time thinking about it, you'll find that it's not much different than the basic cardinality estimate. Because this counter represents the bits in the map as cardinality, we can take two compatible counters and merge their bit bits onto a single map. This algorithm has been dealt with collisions, so we can get a cardinality estimate of the precision required, even if we never put all the input data into a machine. This is very useful and saves us a lot of time and effort in moving data around the network.
Next Steps
Hopefully this article will help you to better understand this concept and the application of the probability counter. If estimating the cardinality of a large set is a problem and you happen to use a JVM based language, you should use the Stream-lib project-it provides several other flow-processing tools and implementations of the algorithms described above.