Use bitmap for big data sorting deduplication

Source: Internet
Author: User

questions raised: M (such as 1 billion int integer, where the number of n is repeated, read into memory, and delete the repeating integer.

Problem Analysis: we would have thought about opening up an array of M int integers in computer memory, one bye to read an array of M int, then a one by one comparison value, and finally the deduplication of the data. This is, of course, feasible in dealing with small-scale data.

We consider the case of big data: for example, in the Java language, the data for 1 billion int types is drained.

An int type in Java occupies 4 bytes in memory. Then 1 billion int type data altogether need to open up 10 ^ 9 times byte≈4gb of contiguous memory space. Take 32-bit operating system computer as an example, the maximum supported memory is 4G, the available memory is less than 4G. So the above method doesn't work when dealing with big data.

thinking Transformation: Since we cannot open an array of int types for all types of int data, we can take a smaller data type to read the cached int type data. Considering that the data processed internally by the computer is a bit of 01 sequence, can we use 1bit to represent an int type of data?

The derivation of bit mappings: Use smaller data types to refer to larger data types. As mentioned above, we can use 1 bit

to correspond to an int integer. If data of the corresponding int type exists, the corresponding bit is assigned a value of 1, otherwise, the assignment is 0 (boolean type). The int range in Java is -2^31 through 2^31-1. The length of all possible numeric components is 2^32.  The corresponding bit length is also 2^32. Then you can use this process only to open up 2^32 bit = 2^29 byte = 512M size of memory space. Obviously, this processing will satisfy the requirements, although the memory consumption is not too small.

Problem Solution: First define the int-byte mapping relationship, of course, mapping relationships can be customized. But the premise is to ensure that your array superscript cannot cross.


But as the bit[] array defined above is obviously nonexistent in the computer, we need to convert it to a basic data type store in Java. Obviously, byte[] is the best choice.

Convert it to byte[] array scheme:

A custom mapping relationship table, each bit corresponding to an int value, I will be the maximum value of int, the minimum value corresponds to the maximum minimum index of the array. It can be seen that the int value differs from the bit index by 2^31 . Of course, you can also define other mapping relationships, just be careful not to have an array out of bounds. Because the maximum value may be 2^32, it is received with long.

Long bitindex = num + (1l << 31);

Calculates the index in the converted to byte[] array, since the Bitindex index defined above is non-negative, it is not necessary to introduce bit operations to symbols.

int index = (int) (BITINDEX/8);

Calculates the specific position of the bitindex in the byte[] array index.

int innerindex = (int) (bitindex% 8);

The bit operation is introduced to add the bits of the byte[] array index to the right value

Databytes[index] = (byte) (Databytes[index] | (1 << innerindex));

This solves the problem of reading and discharging the whole big data.

Use bitmap for big data sorting deduplication

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.