[Turn] The bitmap of the idea of mass data solution

Source: Internet
Author: User
Tags bitset

I. Overview

This article will describe the related principles of the BIT-MAP algorithm, bit-map some of the application scenarios of the algorithm, such as bitmap to solve the massive data to find duplicates, to determine whether individual elements in the massive data and so on. Finally, the bitmap features have been used in various scenarios.

Second, Bit-map algorithm

Let's take a look at a scenario like this: Give an ordinary pc,2g memory, require processing an unsigned int integer containing 4 billion non-repeating and no order, give an integer, and ask if you can quickly determine if the integer is in the file 4 billion data?

Problem thinking:

4 billion int (4 billion)/1024/1024/1024 about 14.9G, obviously memory only 2 G, not fit, so it is impossible to put this 4 billion data in memory calculation. The best way to solve this problem quickly is to put the data in memory, so the question now is how to store 4 billion integers in a 2G memory space. An int integer in Java is 4 bytes to 32bit bits, if you can use a bit bit to identify an int integer then the storage space will be greatly reduced, calculate 4 billion int required memory space for 4 billion/8/1024/1024 about 476.83 MB, In this case, we can simply put the 4 billion int numbers into memory for processing.

Specific ideas:

1 int is 4 bytes is 4*8=32 bit, then we only need to request an int array length of int tmp[1+n/32] to store this data, where N represents the total number to be searched, each element in TMP is 32 bits can correspond to the decimal number 0~31, So you can get the bitmap table:

TMP[0]: can represent 0~31

TMP[1]: can represent 32~63

TMP[2] can represent 64~95

.......

Then let's see how the decimal number is converted to the corresponding bit bit:

Assuming that the 4 billion int data is: 6,3,8,32,36,......, then the specific bitmap is expressed as:

How to determine the subscript of an int number in the TMP array, this can actually be divided by 32 to take the integer part, for example: Integer 8 divided by 32 rounding equals 0, then 8 is on tmp[0]. Also, how do we know which bit of the 8 in tmp[0] 32 bits, which is directly mod 32 OK, and as the integer 8, in tmp[0] in the 8th mod on 32 equals 8, then the integer 8 is in tmp[0] in the eighth bit (from the right number).

Three, Bit-map algorithm original implementation

Callout, this part comes from the fifth part of blog:http://blog.csdn.net/hguisu/article/details/7880288. Well, let's take a look at the implementation of the C language:

123456789101112131415161718192021222324                                   //set 设置所在的bit位为1void set(inti) {          a[i>>SHIFT] |=  (1<<(i & MASK));}//clr 初始化所有的bit位为0voidclr(inti) {          a[i>>SHIFT] &= ~(1<<(i & MASK));}//test 测试所在的bit为是否为1inttest(inti){    returna[i>>SHIFT] &   (1<<(i & MASK));}                                                                      intmain(){   inti;    for (i = 0; i < N; i++)        clr(i);    while(scanf("%d", &i) != EOF)        set(i);    for (i = 0; i < N; i++)        if(test(i))            printf("%d\n", i);    return0;}

Note: The left shift n is multiplied by 2 N, and the right shift n is divided by the n-th square of 2.

parse void set (int i) {A[i>>shift] |= (1<< (i & MASK)) in this example;}
1) I>>shift:
Where shift=5, that I move right 5 is, 2^5=32, equivalent to I/32, that is, the decimal I corresponds to the subscript in the array A. such as i=20, through the i>>shift=20>>5=0 can be obtained i=20 subscript for 0;

2) I & MASK:
where mask=0x1f, hexadecimal is converted to decimal 31, binary is 0001 1111,i& (0001 1111) is equivalent to preserving the post 5 bits of I.

For example, i=23, the binary is: 0001 0111, then
0001 0111
& 0001 1111 = 0001 1110 in binary: 23
For example, i=83, the binary is: 0000 0000 0101 0011, then
0000 0000 0101 0011
& 0000 0000 0001 0000 = 0000 0000 0001 110 Binary: 19

I & Mask is equivalent to i%32.

3) 1<< (I & MASK)
Equivalent to moving the 1 left (I & MASK) bit.
For example (i & MASK) = 20, then i<<20 is equivalent to:
0000 0000 0000 0000 0000 0000 0000 0001 << 20
=0000 0000 00010000 0000 0000) 0000 0000

Notice the "|=" above.

In the blog post: Bit operators and their applications have referred to such bit arithmetic applications:

The K-position of the int variable A is clear 0, i.e. a=a&~ (1<<k)
Place the k position of the int variable a 1, i.e. a=a| (1<<k)

Here will be A[i/32] |= (1<<m)); Position M 1.


4) void set (int i) {A[i>>shift] |= (1<< (i & MASK));} Equivalent to:

    1. void set (int i)

    2. {

    3. A[I/32] |= (1<< (i%32));

    4. }

The three steps mentioned above are implemented:

1. Find the decimal 0-n corresponding to the subscript in array a: N/32

2. Find the number of 0-n corresponding to 0-31: n%32=m

3. Use shift 0-31 to make the corresponding 32bit bit 1:1<<m, and 1;

Iv. Bitmap algorithm Some other application scenarios extend

(1) BitMap Small variant: 2-bitmap.

Look at a small scene: Find an integer that is not duplicated in 300 million integers and limit memory to 300 million integers.

For this scenario I can use 2-bitmap to solve, that is, for each integer assigned 2bit, with a different combination of 0, 1 to identify the special meaning, such as 00 means that this integer does not appear, 01 means that the occurrence of one time, 11 means that there are multiple occurrences, you can find the duplicate integer, The required memory space is twice times the normal bitmap, which is: 300 million *2/8/1024/1024=71.5mb.

The specific process is as follows:

Scan 300 million integers, group bitmap, first look at the corresponding position in the bitmap, if 00 becomes 01, 01 becomes 11, 11 remains the same, when the 300 million integers are scanned, that means the entire bitmap has been assembled. Finally, check that the bitmap will output an integer corresponding to bit 11.

(2) Sorts integers that do not have duplicate elements.

For a non-repeating integer ordering bitmap has a natural advantage, it only needs to be given a non-repeating integer scan completed, assembled into a bitmap, then the direct traversal of the bit area can achieve the sorting effect.

For example: Sort integers 4, 3, 1, 7, 6

Bitmap as follows:

The sort result can be obtained directly by the bit bit output.

V. Summary

This paper mainly describes the related concepts of bitmap algorithm and some relevant application scenarios and implementation methods. In fact, bitmap application scenario far more than point, for example, can also be used for compression, crawler system URL to weight, solve the problem of all combinations. Some people may feel that the bitmap algorithm is a bit cumbersome to implement, in fact, some languages are bitmap algorithms are encapsulated, such as Java in the corresponding bitmap data structure has bitset class. It's fairly simple to use, see if the API is OK, or give an example:

1234567891011121314 importjava.util.BitSet;publicclassTest{    publicstaticvoidmain(String[] args) {        int[] array = newint[] {1,2,3,22,0,3};        BitSet bitSet  = new BitSet(6);        //将数组内容组bitmap        for(inti=0;i<array.length;i++)        {            bitSet.set(array[i], true);        }       System.out.println(bitSet.size());        System.out.println(bitSet.get(3));    }}

The corresponding bit bit if there is a corresponding integer then pass bitset.get (x) will return true, reverse false. where x is the bitmap position subscript.

Well, bitmap is here. The next blog is about the "Tiger Balm"-hash algorithm, which deals with massive amounts of data, and its application in the MapReduce framework.

Transfer from http://zengzhaozheng.blog.51cto.com/8219051/1404108

[Turn] The bitmap of the idea of mass data solution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.