[Data Structure] Bit-map space compression and quick sort de-weight

Source: Internet
Author: User

Bit-map is a very ingenious data storage structure. The so-called Bit-map is to use a bit bit to mark the value of an element, and key is the element. The storage space can be greatly saved by using bit as the unit to store the data . Bit-map also has a wide range of applications in practice, such as fast sequencing, element de-weight, space reduction, and so on. In this paper, Bit-map and its extended structure Bloom filter are introduced through several application examples of Bit-map.

1. Basic ideas of Bit-map

32-bit machine, for an integer number, such as int a=1 in memory for 32bit, this is to facilitate computer operations. But for some scenarios, This is a huge waste, because we can use the corresponding 32bit bit corresponding to store the number of decimal 0-31, and this is the basic idea of bit-map. The BIT-MAP algorithm uses this idea to process large amounts of data in order, query, and deduplication.

2. Quick ordering of BIT-MAP applications

Suppose we want to sort the 5 elements (4,7,2,5,3) within 0-7 (assuming that the elements are not duplicated here). Then we can use the Bit-map method to achieve the purpose of sorting. To represent 8 numbers, we only need 8 bit (1Bytes), first we open 1Byte space, all the bit bits of these spaces are set to 0, such as:

Then traversing the 5 elements, first the first element is 4, then 4 corresponds to the position of 1 (you can do this p+ (I/8) | (0x01<< (i%8)) Of course, the operation here involves Big-ending and little-ending, where the default is big-ending), because it is zero-based, so the fifth position is set to one (for example):

Then the second element 7 is processed, the eighth position is set to 1, and then the third element is processed, until the final processing of all the elements, the corresponding position is 1, the state of the memory bit is as follows:

Then we now traverse through the bit area, and the bit is the number output (2,3,4,5,7) of a bit, so that it achieves the purpose of sorting, time complexity O (n).

The advantages and disadvantages of bit-map for sorting are equally obvious.

  Advantages:

  1. High computational efficiency, no need to compare and shift;

  2. Take up less memory, such as n=10000000; just use memory as n/8=1250000byte=1.25m.

  Disadvantages:

all the data cannot be duplicated.    That is, duplicate data cannot be sorted and searched.    

3. Fast Weight-Bit-map application

This is a problem that often arises during an interview. For example,250 million integers to find the number of distinct integers, memory space is not enough to accommodate these 250 million integers .

First, based on "insufficient memory space to accommodate these 250 million integers" we can quickly associate to Bit-map. The key question below is how to design our bit-map to represent the state of these 250 million numbers. In fact, the problem is very simple, a number of States only three, respectively, there is no, only one, there is repetition. So we just need 2bits to store the state of a number, assuming we set a number that does not exist at 00, exists once 01, exists two times, and is 11. Then we probably need about dozens of megabytes of storage space.

The next task is to traverse the 250 million numbers once, if the corresponding status bit is 00, it will be 01, if the corresponding state bit is 01, it will be 11, and if 11, the corresponding state bits remain unchanged.

Finally, we count the status bit to 01, and we get the number of non-repeating numbers, and the time complexity is O (n).

4. Quick query of BIT-MAP application

Similarly, we can also use Bit-map to make quick queries, in which case only one bit bit is required for a number, 0 means no, and 1 means there. Assuming that the above topic is changed, how to quickly determine that a number is sufficient to exist in the above 250 million sets of numbers.

As before, we first iterate over all the numbers and change the corresponding turn-state to 1. After the traversal is the query, because our bit-map take is continuous storage (integer array form, an array element corresponding to 32bits), we actually adopt a kind of idea of a barrel. An array element can store 32 status bits, which divides the number to be queried by 32, navigates to the corresponding array element (bucket), and then the remainder (%32), which can be positioned to the corresponding state bit. If 1, the number is present; otherwise, it does not exist.

5. bit-map extension--bloom Filter

Bloom Filter: A spatial efficient random data structure that uses a bit array to represent a collection very succinctly and to determine whether an element belongs to the set.

  The efficiency of Bloom filter is a cost: when judging whether an element belongs to a set, it is possible to mistake elements that do not belong to this set as belonging to this set (false positive), so Bloom filter is not suitable for those "0 error" applications, In applications where low error rates can be tolerated, the Bloom filter provides significant savings in storage space with minimal errors.

Collection Representations and element queries

Let's take a look at how Bloom filter uses bit arrays to represent collections. In the initial state, the Bloom filter is an array of bits with M bits, each of which is set to 0.

To express s={x1, X2,..., xn} A collection of n elements, Bloom filter uses K-independent hash functions (hash function), which map each element in the collection to the scope of {1,..., m}, respectively. For any one element x, the location of the I-hash function mapping Hi (x) is set to 1 (1≤i≤k). Note: If a position is set to 1 multiple times, only the first time will work, and the next few times will have no effect. In, k=3, and there are two hash functions selected in the same position (from the fifth digit to the left, i.e. the second "1").

When determining whether y belongs to this set, the K-hash function is applied to Y, and if all hi (y) positions are 1 (1≤i≤k), y is considered an element in the collection, otherwise it is assumed that y is not an element in the collection. The y1 is not an element in the collection (because Y1 has a point pointing to the "0" bit); Y2 either belongs to the collection or is just a false positive.

6. Summary

Using the idea of bit-map, we can compress the storage space, and we can quickly sort, de-weigh and query the numbers. Bloom Fliter is a kind of extension of Bit-map thought, it can make space compress greatly under the circumstance of allowing low error rate, and it is a data structure of exchanging error rate for space.

[Data Structure] Bit-map space compression and quick sort de-weight

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.