The bitmap of massive data processing

Source: Internet
Author: User
Tags bitset

There is a scenario in which a common pc,2g memory requires processing an unsigned int integer containing 4 billion duplicates and no order, giving an integer that asks if the integer is quickly judged to be in the file 4 billion data?

Problem thinking:

4 billion int (4 billion)/1024/1024/1024 about 14.9G, obviously memory only 2 G, not fit, so it is impossible to put this 4 billion data in memory calculation. The best way to solve this problem quickly is to put the data in memory, so the question now is how to store 4 billion integers in a 2G memory space . An int integer in Java is 4 bytes to 32bit bits, if you can use a bit bit to identify an int integer then the storage space will be greatly reduced, calculate 4 billion int required memory space for 4 billion/8/1024/1024 about 476.83 MB, In this case, we can simply put the 4 billion int numbers into memory for processing.

Specific ideas (bitmap thought):

1 int is 4 bytes or 4*8=32 bit, then we only need to apply an int array length of int tmp[1+N/32] to store this data, where N is the total number of lookups to be made , The presence of each element in TMP can represent a decimal number 0~31, so you can get the bitmap table as a 32-bit representation:

TMP[0]: can represent 0~31

TMP[1]: can represent 32~63

TMP[2] can represent 64~95

.......

Then let's see how the decimal number is converted to the corresponding bit bit:

Assuming that the 4 billion int data is: 6,3,8,32,36,......, then the specific bitmap is expressed as:

(1) How to determine which TMP array the int number is placed in: Divide the number directly by 32 to take the integer part (X/32), for example: Integer 8 divided by 32 rounding equals 0, then 8 is on tmp[0];

(2) How to determine which bit of a number is placed in 32 digits: mod32 the number (x%32). In the example above, we can determine which of the 32 bits in the 8 in Tmp[0], which is directly mod 32 OK, and the integer 8, in tmp[0] in the 8th mod on 32 equals 8, then the integer 8 is in tmp[0] eighth bit (from the right number).

First, what is bitmap

Bit-map will use bit to mark the value of an element, how to tag it, see the following example: We now have an array of (1,2,5,8,10), which is generally declared:

Int[] Array = {1, 2, 5, 8, 10}

The above declaration will take up 4x5 bytes, that is, 20 bytes, a small amount of data may not have a very big feeling, if the array length is 10,000,000, this way will occupy 4G of memory.

If you use Bit-map, you can organize:

byte[] bytes = new Bytes[2];

Bytes[0] = 01100100; Just write the binary directly.

BYTES[1] = 10100000;

For example: using a bit vector to represent the data: 1, 3, 6, 10,

         //1 3 6 100 
         BitSet BitSet = new BitSet (+); 
         Bitset.set (1,true); 
        bitset.set (3,true); 
        Bitset.set (6,true); 
         Bitset.set (100,true); 
       for (int i=0;i<bitset.size (); i++) {  
      Boolean b = Bitset.get (i); 
      if (b) {& nbsp
            System.out.println (i); 
              } 
         } 
      }  

Ii. Establishment of Bit-map

1, open the fixed-length array

Bit-map declares a fixed-length byte/int array, and then resets all bit bits of the elements in the array to 0, such as:

2. Traverse the data and insert the Bit-map

In the example above, Array{1, 2, 5, 8, 10} are traversed, and all the elements are inserted into the bit-map. Bit-map is the extreme of hash, then key is Array[i]/8,value that is in byte position array[i]%8. In practice, for efficiency, the hash function may be somewhat out of the box. As follows:

The traversal of the data after the insertion should look like this:

Three, the basic idea of Bit-map

Let's take a look at a specific example, assuming we want to sort the 5 elements (4,7,2,5,3) within 0-7 (assuming that the elements are not duplicated). Then we can use the Bit-map method to achieve the purpose of sorting. To represent 8 numbers, we only need 8 bit (1Bytes), first we open 1Byte space, all the bit bits of these spaces are set to 0, such as:

Then traversing the 5 elements, first the first element is 4, then 4 corresponds to the position of 1 (you can do this p+ (I/8) | (0x01<< (i%8)) Of course, the operation here involves Big-ending and little-ending, where the default is big-ending), because it is zero-based, so the fifth position is set to one (for example):

Then the second element 7 is processed, the eighth position is set to 1, and then the third element is processed, until the final processing of all the elements, the corresponding position is 1, the state of the memory bit is as follows:

Then we now traverse through the bit area, which is the number output (2,3,4,5,7) of bits of a bit, so that the order is reached.

Advantages: 1. High efficiency, no comparisons and shifts are allowed;

2. Use less memory, such as n=10000000; just use memory as n/8=1250000byte=1.25m

Disadvantages:

All the data cannot be duplicated. That is, duplicate data cannot be sorted and searched.

The algorithm is relatively simple, but the key is how to determine the decimal number map to the binary bit bit map map.

Iv. Map Mapping Table

Assuming that you need to sort or find the total number of n=10000000, then we need to apply the size of the memory space to int a[1 + N/32], where: a[0] accounted for 32 in memory for the decimal number 0-31, and so on:
The bitmap table is:
A[0]--------->0-31
A[1]--------->32-63
A[2]--------->64-95
A[3]--------->96-127
..........
Then how the decimal number is converted to the corresponding bit bit, the following describes the use of displacement to convert the decimal number to the corresponding bit bit.

Displacement conversion

To apply an int one-dimensional array, you can use it as a two-dimensional array that is listed as the

| level |

int a[0] |0000000000000000000000000000000000000|

int a[1] |0000000000000000000000000000000000000|

..................

int A[n] |0000000000000000000000000000000000000|

For example, decimal 0, corresponds to the first bit in the bit for a[0]: 00000000000000000000000000000001

V. Bitmap application Scenario Extension

After the establishment of the BIT-MAP, it can be used conveniently. In general, Bit-map can be used as data search, deduplication, sorting and other operations.

As mentioned above 10,000,000 data storage problems, with an integer storage, consumes 4G of memory. Change to Bit-map, consuming 125MB of memory. However, in practice, this method is not applicable because the maximum minimum difference between the data is too large, such as {99999}, there are only three numbers, but the maximum is the smallest.

Finding and de-weight are good to understand, as for sorting, sort of like a bucket, each byte is a bucket.

1. Find out the number of duplicate integers in 300 million integers, limit memory to hold 300 million integers

For this scenario can be solved by 2-bitmap, that is, to allocate 2bit for each integer, with a different combination of 0, 1 to identify the special meaning, such as 00 means that this integer does not appear, 01 means that the occurrence of one time, 11 means that there are multiple occurrences, you can find the duplicate integer, The required memory space is twice times the normal bitmap, which is: 300 million *2/8/1024/1024=71.5mb.

the specific process is as follows: scan 300 million integers, group bitmap, first look at the corresponding position in the bitmap, if 00 becomes 01, 01 becomes 11, 11 remains the same, when the 300 million integers are scanned, that means the entire bitmap has been assembled. Finally, check that the bitmap will output an integer corresponding to bit 11.

2. Sort integers with no repeating elements

For a non-repeating integer ordering bitmap has a natural advantage, it only needs to be given a non-repeating integer scan completed, assembled into a bitmap, then the direct traversal of the bit area can achieve the sorting effect.

For example: Sort integers 4, 3, 1, 7, 6:

Just press the bit bit output to get the sorting result.

3, it is known that a file contains some telephone numbers, each number is 8 digits, statistics of the number of different numbers

8-bit up to 99 999 999, approximately 99m bits, about 10 m bytes of memory. Can be understood as the number from 0-99 999 999, each number corresponds to a bit bit, so only need 99M bit==1.2mbytes, so that a small 1.2M or so of memory represents all the 8-digit number of telephones.

4, 250 million integers to find the number of distinct integers, memory space is not enough to accommodate these 250 million integers

To extend the Bit-map, use 2bit to represent a number: 0 means that it does not appear, 1 means that it appears once, 2 means that 2 times and above are present, that is, repeat, if the value of the corresponding position is 0, if it is at the same location, it is set to 1 if it is 1, and if 2, it remains unchanged. Or we do not use 2bit to express, we can simulate the implementation of this 2bit-map with two bit-map, all the same truth.

For the use of bitmap see also: http://my.oschina.net/cloudcoder/blog/294810?fromerr=62qBkJF5

http://blog.csdn.net/hguisu/article/details/7880288

Note:bitset.size () returns the number of bits used to actually use space when this bitSet represents a bit value; An integer multiple of 64;

New BitSet (950) does not mean to establish a 950-size BitSet, just that the initial size of the built BitSet can hold at least 950 bit, the size is always system control, and its size is a multiple of 64, even if BitSet (1), Its size is also 64

Bitset can guarantee that "if the decision result is false, then the data must not exist, but if the result is true, then the data may or may not exist (conflict overlay)", i.e. False==yes;true==maybe

The bitmap of massive data processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.