Train of Thought: from simple sorting to the Bitmap Algorithm, to the data de-duplication problem, to the big data processing tool: bloom filter.
Scenario 1: Sort non-duplicate data
@ How to sort the given data (, 6?
Method 1: The basic sorting methods include bubble and fast sorting.
Method 2: Use the Bitmap Algorithm
Method 1 is not introduced. The so-called bitmap in method 2 is a single-digit group. The only difference between bitmap and the array used at ordinary times is that bitmap operates.
The first step is to open up a two-byte array with a length of 16 (this length is determined by the maximum number 12 in the above data)
Then, read the data, 2 stores the place marked as 1 in the in-place array, and the value is changed from 0 to 1, 4 stores the place marked as 3, and the value is changed from 0 to 1... result
Finally, read the array to obtain the sorted data: (1, 2, 4, 6, 7, 9, 12)
The difference between method 1 and Method 2: In method 2, the time complexity and space complexity required for sorting are dependent on the largest number in the data, such as 12, therefore, in terms of space, two bytes of memory need to be opened, and the whole array needs to be traversed in terms of time. When the data is similar to (0.1 million,) with only three pieces of data, it is obvious that method 2 is used. The time complexity and space complexity are quite large, however, this method will show its advantages when the data is encrypted.
Scenario 2: duplicate data Determination
Data (,) how to identify repeated numbers?
The first step is to open up a two-byte array with a length of 16 (this length is determined by the maximum number 12 in the above data)
When 12 is read, the data in the array is as follows:
When reading 2 and finding that the value in the array is 1, it is determined that 2 is repeated.
Application
Application 1: A file contains eight phone numbers. How many phone numbers are counted? (Determine who appears)
The maximum value of 8 is 99 999 999, which is about 99m bit. The memory size of MB can be counted.
Application 2: A file contains eight phone numbers? (Judge who appears and indicates the occurrence once)
You can use two bits to indicate a number. 0 indicates that no number is displayed. 1 indicates that only one number is displayed. 2 indicates that at least two numbers appear.
Application 3: There are two files, file 1 contains 0.1 billion 10-digit QQ numbers, and file 2 contains 10 million 10-digit QQ numbers, to identify the repeated QQ numbers in the two files.
First, create a 10-to-the-power Bit Array (occupied memory is about 1.25 GB), Initialize all to 0, read the first file, and store the corresponding QQ number to the unknown, change the value to 1. after reading the first file, read the second file. If the corresponding position is 1, it indicates that the file appears again.
Application 4: There are two files, file 1 contains 0.1 billion QQ numbers with 15 digits, and file 2 contains 10 million QQ numbers with 15 digits, to identify the repeated QQ numbers in the two files.
In app 4, when the QQ number is increased to 15 digits, the memory is obviously not enough. What should I do at this time? Use bloom
Filter (Bloom filter)
Bloom filter (Bloom filter ):
For Bit-map analysis, each time a bit array is opened to indicate the maximum value, such as 16 in scenario 1, the corresponding data is mapped to the subscript of the bit array, this is actually the simplest hash algorithm for 1 de-modulo. In application 4 above, when the QQ number is changed to 15 digits, bit-map is not easy to use. How can we improve it? Solution: reduce the length of the bit array, but increase the number of hash functions.
For each QQ number, I use k Hash Functions and perform K ing to obtain k different locations. If K is 3, for a QQ number, map three different positions in the in-place Array
When you read the second file containing tens of millions of QQ numbers, use the same three hash functions for ing. When the three positions are all 1, it indicates that it has appeared, otherwise, it indicates that it has not appeared.
Do you have any questions?
Apparently, for a QQ number, if it does not appear in the first file, but it maps all three positions to 1, will it? The answer is yes, but this probability is controllable and controllable, which means that this error is related to the number and quality of hash functions, you can control the error probability by controlling the number of hash functions and the size of bit arrays. As for the mathematical formulas that represent the relationships between the three, they will not be studied in detail.
In this case, the bloom filter is further extended by bit-map. For heavy data volumes, the bloom filter can be judged in the memory to avoid disk read/write, high efficiency. The above is your understanding of the two. If you have any mistakes, please kindly advise.