Big data-skillfully using bit array sorting and weighing and simple application of Bron filter

Source: Internet
Author: User

Tip: Sort a dataset with no duplicates

For a given dataset, (2,4,1,12,9,7,6) How to sort it?

The first way, using the most basic bubble, fast row, base sorting, etc., the minimum time complexity of 0 (NLOGN).

The second way, the use of bit array sorting algorithm.

For the collation of the data set, I believe most will be remembered in the first time, and for method two, we need some thinking.

For our given data set, the maximum value is 12, then we can open a byte array of length 12, for example, are initialized to 0,


Then read the data, for example, when reading to 2, the subscript 1 is the position of 1, that is, a[2-1] = 1, when the array state such as:


When you read 4, the state of the array is as follows:

When the reading is complete, the state of the array is as follows:


At this point, we go through the character array again, and sequentially record the following table of all the elements with a value of 1, that is, 0,1,3,5,6,8,11, which is 1,2,4,6,7,9,12.

Comparison method One and method two:

Method One, the time space complexity of the best case is O (Nlogn), O (Nlogn).

Method Two, time complexity and spatial complexity are completely dependent on the largest number in the DataSet, O (n), if the dataset is appropriate, it is a very efficient algorithm. Its limitations are also obvious: the data set needs to be non-repeating set, the data distribution is dense, the data range upper and lower limit difference can not be too large, the maximum number should not be too large.


Tip Two: Weigh the duplicate data

How do I find duplicate numbers for a given dataset (2,4,1,12,2,9,7,6,1,4)?

Method one, sort by using the sort algorithm, and then traverse to find the element that recurs.

Method Two, first to open a length of 12 bit array, and initialized to 0. Then iterate over the element set, for each occurrence of the data, such as 2, then a[2-1] = 1, and so on, has been traversed to 12, this time the state of the bit array, such as:


For the next element 2, when accessing a[1], a[1]=1 is found, stating that 2 is the number of occurrences before 2.


Practical Use Cases

App 1: A file contains some 8-bit phone numbers, the number of statistics numbers appear?

8-digit telephone number, the largest integer, 99999999, the required array of bits to be built is probably occupied 100000000/8 = 1250000B = 12500K = 12.5MB, you can see the need for about 12.5MB of space.


App 2: A file contains some 8-bit phone numbers, counting only one occurrence of the number?

Here you need to expand, you can use two bit to represent a number, 00 is not present, 01 represents one time, 10 means two times, 11 means three. At this point, approximately 25MB of space is required.


Application 3: There are two files, file 1 has 100 million 10-digit QQ number, file 2 has 50 million 10-digit QQ number, judging two files repeat the QQ number?

With the application 1, the 10-bit QQ number is 9999999999, the array needs to be set up to occupy 1250MB or 1.25G, all initialized to 0. Reads the first file, and the bit that appears is marked as 1. After marking, then read the second file, for reading to each number, if the number corresponding to the bit is 1, then the number is repeated QQ number.


Application 4: There are two files, file 1 contains 100 million 15-digit QQ number, the file 2 contains 50 million 15-digit QQ number, judging two files repeat the QQ number?

QQ number in this place to upgrade to 15 bit, if you also use the method of application 3, at this time the space required is about 125000GB = 125TB, obviously this is unrealistic, then we have to seek other solutions.


Bloom Filter (Bron filter)

For Bit-map analysis, each time will open a block representing the maximum numeric size of a bit array, such as Scenario 1 in 12, the corresponding data mapped to the bit array subscript, which is actually a simple hash algorithm, 1 to die. In the above application 4, when the QQ number is changed to 15-bit, bit-map is not very useful, how to improve it? WORKAROUND: Reduce the length of the bit array and increase the number of hash functions .

For each QQ number, I use K hash function, after K map, get k different position, assuming k=3, then for a QQ number, map in place array of 3 different positions.



When reading the second contains 50 million QQ number of the file, using the same 3 hash function to map, when 3 locations are all 1 when the show has occurred, otherwise there is no show.

This algorithm is highly efficient for large data volumes, as all operations are done in memory, avoiding the effect of disk read and write on speed.

But this algorithm does not guarantee 100% accuracy, can only guarantee the controllable error range. The precise mathematical formula for this controllable error is specifically described in the articles related to Bloomfilter.



Big data-skillfully using bit array sorting and weighing and simple application of Bron filter

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.