In Clearspring, we are engaged in statistical data. It is a challenge to count a large set of data sets that are of different elements.
To better understand the challenges of large datasets that already have a clear cardinality, let's say your log file contains a 16-character ID, and you want to count the number of different IDs. For example:
4f67bfc603106cb2
These 16 characters need to be represented by 128 bits. 65,000 IDs will require 1MB of space. We receive 30多亿条 event records every day with an ID for each record. These IDs require 384 billion-bit or 45GB of storage. This is just the space required for the ID field. We take an easy way to get the ID-based data in the daily event record. The simplest approach is to use a hash set and store it in memory, where the hash set contains a list of unique IDs (that is, the IDs of multiple records in the input file may be the same, but only one in the hash set). Even if we assume that only 1/3 of the record IDs are unique (that is, 2/3 of the record IDs are duplicates), the hash set still requires 119GB of RAM, which excludes the overhead that Java needs to store objects in memory. You need a machine with hundreds of GB of memory to compute different elements, and this is just a memory consumption that calculates the unique ID of the log event record in a day. If we want to count weeks or months of data, this problem will only become more difficult. Of course we don't have a spare machine with hundreds of GB of RAM, so we need a better solution.
A common way to solve this problem is to use bitmaps (blog: mass data processing algorithm-bit-map). Bitmaps can quickly and accurately get the cardinality of a given input. The basic idea of a bitmap is to use a hash function to map a dataset to a bit bit, with each INPUT element corresponding to the bit bit one by one. This way the hash will not collide and reduce the need to calculate each element mapping to 1 bit space. Although Bit-map greatly saves storage space, they still have problems when statistics are high cardinality or very large different datasets. For example, if we want to use Bit-map meter billions of, you will need a bit-map bit, or a counter that requires each of the approx. MB. Sparse bitmaps can be compressed to get more space efficiency, but they are not always helpful.
Fortunately, cardinality estimation is a hot research area. We have used this research to provide an open source implementation of cardinality estimation, set element detection, and TOP-K algorithms.
The cardinality estimation algorithm is the use of accuracy in exchange for space. To illustrate this, we use three different calculations to count the number of different words in all of Shakespeare's works. Note that our input datasets add additional data so that the reference base is higher than the problem. These three technologies are: Java HashSet, Linear probabilistic Counter, and a hyper Loglog Counter. The results are as follows:
The table shows that we count these words only with the bytes, and the error is less than 3%. By contrast, HashMap has the highest count accuracy, but requires nearly 10MB of space, and you can easily see why cardinality estimation is useful. Accuracy is not very important in practical applications, which is the fact that in most network sizes and network calculations, using probability counters can save huge space.
Linear probability counter
The linear probability counter is an efficient space to use and allows the implementation to specify the desired level of precision. This algorithm is useful in focusing on space efficiency, but you need to be able to control the error of the results. The algorithm runs in two steps: The first step is to allocate a bit-map initialized to 0 in memory, and then hash each entry in the input data using a hashing function, The result of the hash function operation is to map each record (or element) to a bit bit of bit-map, which is set to 1, and the second step, the algorithm calculates the number of empty bit bits, and uses this number input to the following formula to estimate:
N=-m LN Vn
Note: ln vn=loge (Vn) Natural logarithm
In the formula,m is the size of the Bit-map, and VN is the ratio of the empty bit bit to the size of the map. It is important to note the size of the original Bit-map, which can be much smaller than the expected maximum cardinality. How small it is depends on the size of the error you can afford. Because the size m of Bit-map is less than the total number of different elements, collisions will occur. Although collisions can save space, they also cause errors in the estimation results. So by controlling the size of the original map, we can estimate the number of collisions so that we will see how much error is in the final result.
Hyper Loglog
As the name implies, the Hyper loglog counter is an estimate of the Nmax-cardinality dataset using only Loglog (Nmax) +o (1) bits. The Hyper Loglog counter, such as the linear counter, allows the designer to specify the desired precision value, in the case of Hyper Loglog, which is defined by the desired relative standard deviation and the maximum cardinality expected to be counted. Most counters work by using an input data stream Mand applying a hash function to set h (m) . This results in an observable result of a s = h (M) of {0,1}^∞ string. By splitting the hash input stream into M-substrings and holding the value of m for each sub-input stream to be observable, this is quite a new hyper Loglog (a sub M is a new hyper Loglog). Using an average of the additional observations, a counter is generated, with the accuracy increasing with the increase of M, which can be done in a few steps for each element in the input set. As a result, this counter can use only 1.5 KB of space to compute 1 billion different data elements with a precision of 2%. The efficiency of this algorithm is obvious compared to the 120 megabytes required to perform the hashset.
Merging distributed counters
We have shown that we can estimate the cardinality of a large set using the counters described above. But what if your original input dataset is not suitable for a single machine? This is the problem we face in clearspring. Our data is scattered across hundreds of servers, and each server contains only a subset of the entire set of data. The fact that we can merge the contents of a set of distributed counters is critical. The idea is a bit confusing, but if you spend some time thinking about it, you'll find that it's not much different than the basic cardinality estimate. Because this counter represents a bit in the map as the cardinality, we can take two compatible counters and merge their bit bits into a single map. This algorithm has dealt with collisions, so we can get a cardinality estimate of the precision required, even if we never put all the input data into a single machine. This is very useful and saves us a lot of time and effort in moving data over the network.
Next Steps
Hopefully this article will help you to better understand this concept and the application of the probability counter. If estimating the cardinality of a large set is a problem and you happen to be using a JVM-based language, you should use the Stream-lib project-it provides several other stream processing tools and the implementation of the algorithm described above.
This article is from: High Scalability
My concluding remarks: self-admission English is not good, translation may have discrepancies. But I read the translation of CSDN, I doubt it is a technician who translated it? Some translations are directly translated by tools.
For an in-depth understanding of Hyper Loglog: See Http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf and http://www.ic.unicamp.br /~celio/peer2peer/math/bitmap-algorithms/durand03loglog.pdf
At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large.
To better understand the challenge of determining the cardinality for large sets let's imagine that you have a character ID and you ' d like to count the number of distinct IDs so you ' ve seen in your logs. Here are an example:
4f67bfc603106cb2
These characters represent, bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and Each event have an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of Storage. and that's just the space that the ID field requires! To get the cardinality of IDs in our daily events we could take a simplistic approach. The most straightforward idea was to use a in memory hash set contains the unique list of IDs seen in the input files . Even if we assume that only 1 in 3 records is unique the hash set would still take 119 gigs of RAM, not including The&nbs P;overhead java requires to store objects in memory. You would need a machine with several hundred gigs of memory to count distinct elements A single day ' s worth of unique IDs. The problem only gets more difficult if we want to count weeks or months of data. We certainly don ' t has a single machine with several hundred gigs of freeMemory sitting around so we needed a better solution.
One common approach to this problem are the use Of bitmaps. Bitmaps can used to quickly and accurately get the cardinality of a given input. The basic idea with a bitmap was mapping the input dataset to a bit field using a hash function where each INPUT element UN Iquely maps to one of the bits in the field. This produces zero collisions, and reduces the space required to count each of the unique element to 1 bit. While bitmaps drastically reduce the space requirements from the naive set implementation described above they is still p Roblematic when the cardinality was very high and/or you had a very large number of different sets to count. For example, if we want to count to one billion using a bitmap you'll need one billion bits, or roughly megabytes fo R each counter. Sparse bitmaps can be compressed on order to gain space efficiency, and that's isn't always helpful.
Luckily, cardinality estimation is a popular area of the. We ' ve leveraged this provide a open source implementation of cardinality estimators, set membership detection, and Top-k algorithms.
Cardinality estimation algorithms trade space for accuracy. To illustrate this point we counted the number of distinct words in all of Shakespeare ' s works using three different count ing techniques. Note that our input dataset have extra data in it so the cardinality are higher than the standard reference answer to this Q Uestion. The three techniques we used were Java HashSet, Linear probabilistic Counter, and a Hyper loglog Counter. Here is the results:
Counter |
Bytes used |
Count |
Error |
HashSet |
10447016 |
67801 |
0% |
Linear |
3384 |
67080 |
1% |
Hyperloglog |
512 |
70002 |
3% |
The table shows that we can count the words with a 3% error rate using only bytes of space. Compare a perfect count using a HashMap that requires nearly megabytes of space and you can easily see why card Inality estimators is useful. In applications where accuracy are not paramount, which are true for the web scale and network counting scenarios, using a Probabilistic counter can result in tremendous space savings.
Linear Probabilistic Counter
The Linear probabilistic Counter is space efficient and allows the implementer-specify the desired level of accuracy. This algorithm was useful when space efficiency was important but you need to being able to control the error in your results. This algorithm works in a two-step process. The first step assigns a bitmap in memory initialized to all zeros. A hash function is then applied to the all entry in the input data. The result of the hash function maps the entry to a bit in the bitmap, and that bit are set to 1. The second step the algorithm counts the number of empty bits and uses that number as input to the following equation to G ET the estimate.
N=-m LN Vn
The equation m is the size of the bitmap and Vn are the ratio of empty bits over the size of the map. The important thing to note are that the size of the original bitmap can be much smaller than the expected Max cardinality. How much smaller depends on what much error you can tolerate in the result. Because the size of the bitmap, M, is smaller than the total number of distinct elements, there would be collisions. These collisions is required to being space-efficient but also result in the error found in the estimation. So by controlling the size of the original map we can estimate the number of collisions and therefore the amount of error We'll see in the end result.
Hyper Loglog
The Hyper loglog Counter ' s name is self-descriptive. The name comes from the fact so can estimate the cardinality of a set with cardinality Nmax using just Loglog (Nmax) + O (1) bits. Like the Linear Counter, the Hyper loglog Counter allows the designer to specify the desired accuracy tolerances. In Hyper Loglog's case it's done by defining the desired relative standard deviation and the max cardinality you expect To count. Most counters work is taking an input data stream, M, and applying a hash function to that set, H (m). This yields an observable result of S = h (M) of {0,1}^∞strings. Hyper Loglog extends this concept by splitting the hashed input stream to M substrings and then maintains M observables For each of the substreams. Taking the average of the additional observables yields a counter whose accuracy improves as m grows in size but only requ Ires A constant number of operations to being performed on each element of the input set. The result is this, according to the AUTHors of thispaper, this counter can count one billion distinct items with a accuracy of 2% using only 1.5 kilobytes of SP Ace. Compare the megabytes required by the HashSet implementation and the efficiency of this algorithm becomes OBVI OUs.
Merging distributed Counters
We ' ve shown that using the counters described above we can estimate the cardinality of large sets. However, what can I do if your raw input dataset does is fit on single? This was exactly the problem we face at clearspring. Our data are spread out over hundreds of servers and each server contains only a partial subset of the "Total" dataset. This is where the fact, we can merge the contents of a set of distributed counters is crucial. The idea was a little mind-bending but if you take a moment to think about it the concept are not that much different than B ASIC cardinality estimation. Because the counters represent the cardinality as set of bits in a map we can take the compatible counters and merge their Bits into a single map. The algorithms already handle collisions so we can still get a cardinality estimation with the desired precision even thou GH we never brought all of the input data to a single machine. This is terribly useful and saves us a lot of time and effORT moving data around our network.
Next Steps
Hopefully this post have helped you better understand the concept and application of probabilistic counters. If estimating the cardinality of large sets is a problem and your happen to use a JVM based language then you should check Out the Stream-lib project-it provides implementations of the algorithms described above as well as several other stream -processing Utilities.
Source: http://blog.csdn.net/hguisu/article/details/8433731
Big Data calculation: How to count 1 billion objects with only 1.5KB of memory