Analysis and application of Java Bitmap/bitvector

Source: Internet
Author: User
Tags bitset

Transferred from: http://shmilyaw-hotmail-com.iteye.com/blog/1741608

Brief introduction

Bitmap is used in a lot of massive data processing situations. Some typical scenarios include data filtering, data bit setup, and statistics. It is often introduced and applied in the case of large amounts of data, with ordinary arrays exceeding the scope of data preservation. Although the use of this kind of bitmap can not fundamentally solve the problem of mass data processing, but in a certain range of data, it is an effective method. Bitmap has a corresponding implementation in the Java class Library: BitSet. We will introduce an introduction to bitmap, and then analyze the subtle implementation of a bitvector in detail, and make a comparison with the Bitset implementations in Java later. In this paper, bitmap, bitvector do not make a distinction, they express the same meaning.

The derivation of bitmap

Suppose we have a large collection of data, such as a set of numbers, which is stored in a large file. It has a total number of 40 billion. There is a lot of duplication of data, if you remove the duplicate elements, the approximate data is 4 billion. Well, suppose we have a machine with a memory of 2GB. How do we eliminate the elements that are duplicated? Further consideration, if we eliminate the duplicate elements, how to count the number of elements inside and save the weight of the elements into another result file?

Let's make a rough estimate first. Assume that the range of numbers is from 0 to Integer.max_value. If we open an array to save it, is it feasible? An int number of 4 bytes, to save 0 to integer.max_value number, then you need 2 31 times, that is, 2G elements. Such a multiplication, unless there is 8GB of memory, will not save so much data at all.

Bitmap Analysis and application

Now, what if we try it in a different way, with bitmap? Bitmap It is essentially an array, just using the corresponding bit in the middle of the array to represent a corresponding number. Suppose we use a byte array. For example, the number 1 corresponds to the first bit of the 1th element of the array. The number 9 exceeds the 8-bit range of the first element, which corresponds to the first bit of the second element. In order to do so, we can map these 4 billion elements into this byte array. A number corresponds to a bit in the array as shown in the relationship:

In, suppose I is a byte in an array, then it should have the following 8 bits. Assuming I is the first byte, the number 1 corresponds to the 1th bit, followed by the elements, and so on.

With this discussion, we can also easily get the relationship between numbers and the exact bits of the elements stored in the array. Suppose there is a number I, which corresponds to the position of the saved element: I/8. Suppose the array is a, then A[I/8]. So which bit does it correspond to in the middle of A[I/8]? It corresponds to the 8 of this element in the section I.

With these discussions, let's look at a concrete implementation of bitmap.

An implementation of bitmap

For the sections discussed earlier, the main features of bitmap include a few aspects. 1. Set: Place a position of 1. 2. Clear the bit (clear), clear one, set it to 0. 3. Read the bit (get), read the data of a bit, see whether the result is 1 or 0. 4. The number of bits (size) that the container can hold is equivalent to the length of the returned container. 5. Number of elements that are set (count), which returns the number of bits that are placed in 1. We will analyze each of them:

First, we'll define a byte array to hold the data. In addition, we need elements to hold the number of all the bits and the number of elements that are placed. Therefore, we have the following definitions:

Private byte [] bits;     Private int size;     Private int count =-1;

Now, suppose we want to construct a bitvector, we need to specify its length. One of its constructors can be constructed as follows:

 Public Bitvector (int  n) {      = n;       New byte [(Size >> 3) + 1];  }

Here, the specified parameter n indicates how many digits, equal to the number of digits to be placed. Since we are going to save with byte, the number of bytes that can hold so many numbers is N/8 + 1. This length is represented by a shift (size >> 3) + 1. The right Shift 3 bits is equivalent to dividing by 8.

Set

As mentioned earlier, set a bit element, you need to find the byte where the element resides, and then set the byte corresponding bit. And N/8 gets the index that corresponds to Byte, and N 8 gets the bit in the corresponding byte. This part of the code is implemented as follows:

 Public Final void Set (int  bit) {      >> 3] |= 1 << (bit & 7);       =-1;  }

Similar to what I discussed earlier, this is just a shift to achieve the same effect. The front bit >> 3 is equivalent to BIT/8. Bit & 7 is equivalent to bit% 8. Why is bit & 7 equivalent to this effect? This approach was also discussed in an article in the previous analysis HashMap implementation. Because here a byte is 8 bits, and 8 corresponds to a binary representation of 1000, then the binary form of 7 is 0111 smaller than it is 1. When the bit and 7 are performed and calculated, all the highs greater than the 3rd bit are set to 0, which retains the lowest 3 bits. Thus, the lowest 3-digit number is 0, and the maximum is 7. It is equivalent to the arithmetic effect of the number 8 modulo.

Clear

Instead of the previous set method, here is the need to place a specific position of 0.

 Public Final void Clear (int  bit) {      >> 3] &= ~ (1 << (bit & 7));       =-1;  }
Get

Get this part of the code is mainly to determine whether this bit is set to 1. We calculate this byte and the number corresponding to bit 1, and if the result is not 0, it is set to 1.

 Public Final boolean get (int  bit) {      return (Bits[bit >> 3] & (1 << ( Bit & 7))! = 0;  }
Count

The implementation of the Count method is a more subtle approach. According to our original understanding, if we want to calculate the number of all the bits that are placed in 1, we need to iterate through each byte and then ask for the number of 1 in each byte. One way to take it for granted is to perform a number with the number 1 shift, and if the result is 0 means that the bit is not set to 1, otherwise the bit has been placed. This approach is fine, but for each byte, this is the equivalent of 8 times times the number of previous computations. If we can optimize it, it has some value for big data. Here is another efficient way to implement a space-time-based approach:

 Public Final intcount () {//if the vector has been modified    if(Count = =-1) {        intc = 0; intEnd =bits.length;  for(inti = 0; I < end; i++) C+ = Byte_counts[bits[i] & 0xFF];//sum bits per byteCount =C; }      returncount; }    Private Static Final byte[] byte_counts = {//Table of Bits/byte0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,      4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8  };

This creates an array of byte_counts. The number that corresponds to a number 1 is recorded. We get a 8-bit number after the bit[i] && 0xff operation, ranging from 0 to 255. Then the problem boils down to the number of 1 in the binary representation of the corresponding number. For example, the number 0 has 0 1, 1 has a 1, 2 has 1 1,3 2 1 .... In a byte, there are up to 256 kinds, if we have the 256 numbers corresponding to the 1 numbers are stored in advance, then the number corresponding to a number to be directly taken.

And the comparison of Bitset

The implementation of the bitmap we discussed earlier is actually a snippet of code from the open source software Lucene. It takes a byte array as a way to preserve the internal data. The operation and operations of the various positions are implemented using binary shift and other operations to achieve the highest possible efficiency. In the class library inside Java, there is actually a similar implementation. That's bitset.

The internal implementation of Bitset is slightly different from the implementation of Bitvector, which internally uses the long[] array to hold the elements. In this way, there is a difference between the position and the clearance operation each time. For example, the set is the number to be placed divided by 8, now is divided by 64, the equivalent of >> 6 of this shift 6 times operation.

In addition, there is no implementation in the Bigset the number of elements that are set to 1, if they are required, because to be in the 64-bit range of numbers to find, it is impossible to use the previous number list method to speed up its statistical speed, only one of the calculation and comparative statistics. This is an inadequate place to achieve this.

Bitset internal code Implementation There is a more interesting place, we first look at this piece of code:

 Public voidSetintBitindex) {      if(Bitindex < 0)          Throw NewIndexoutofboundsexception ("Bitindex < 0:" +Bitindex); intWordindex =Wordindex (Bitindex);        Expandto (Wordindex); Words[wordindex]|= (1L << bitindex);//restores invariantscheckinvariants (); }    Private Static intWordindex (intBitindex) {      returnBitindex >>Address_bits_per_word; }

This is the corresponding placement implementation method in Java. As we understand it, it should be to find the corresponding long element, and then set the corresponding bit to 64 after modulo to 1. However, the setup part of this code is as follows: Words[wordindex] |= (1L << bitindex); Restores invariants. The shift is used here, but there is no 64 modulo. Why is it? So it's not going to go wrong? In our understanding, if you shift the numbers to the left, if you go beyond the representation of the numbers, subconsciously you will think that those parts are ignored. If you think so, then this one will shift down and get a 0? We will continue to analyze this point later.

An interesting place.

The answer to this question is not complicated. If we look at the definition of the book, we can see it carefully. Such shift operations as << >> are, in effect, cyclic shifts. That is, if I shift a number to the left to overflow, it is not ignored, but the subsequent will continue to fill in the low. For example, let's look at one of the simplest codes:

class Test  {      publicstaticvoid  main (string[] args)      {          for ( int i = 0; I < 100; i++)      System.out.println (1 << i);      }  }

If we execute this piece of code, we'll see that the actual result is that when the overflow starts again, it begins to show again from the beginning, and some of the output is as follows:

1  2  4  8  2048  4096  8192  16384  32768  65536  131072  262144  524288  1048576  2097152  4194304  8388608  16777216  33554432  67108864  134217728  268435456  536870912  1073741824  -2147483648  1  2  4  8

Now, we also understand why the front is directly represented by a left-shift operation. Because this is the shift of the loop, it is equivalent to the result of the calculation of the modulo. To be honest, this is a good way to do it, but personally it's not intuitive, or it's better to use a method similar to modulo arithmetic.

Summarize

Bitmap the presence or absence of data by making full use of the placement of each bit in the array. For example, if one is set to 1, the data exists, otherwise the representation does not exist. By making full use of the data space, it is more efficient than directly using an array and then each element inside the array to represent an array of space utilization. For example, there is an int array of equal length, an int element is used to represent a data, and now takes advantage of each bit of the int element, which can represent 32 elements. So, to a certain extent, some data mapping, filtering and other problems by bitmap it can be processed more scope. Of course, bitmap is also limited by the data representation of the computer itself, beyond a certain range, we still need to consider the combination of data partitioning and other means. In addition, when considering the detailed implementation of these data structures, there are a lot of details that will deepen our understanding, perhaps a lot of what we usually ignore.

Resources

Http://alvinalexander.com/java/jwarehouse/lucene-1.3-final/src/java/org/apache/lucene/util/BitVector.java.shtml

Http://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html

Analysis and application of Java Bitmap/bitvector

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.