Power of algorithms-Application of bit operations in sorting and searching

Source: Internet
Author: User
Tags import database

Wedge:
Question: If a file contains 0.9 billion non-repeated 9-bit integers, sort the file now.

General solutions:
1. import data to memory
2. Sort data (such as insert sorting and quick sorting)
3. Store sorted data into files

Difficulties:
An integer is 4 bytes.
Even if arrays are used, 900,000,000*4 byte = 3.4g memory is required.
For 32-bit systems, it is very difficult to access more than 2 GB of memory, and generally the device does not have so much physical memory
It is unrealistic to completely import data into the memory.

Other solutions:
1. Import database operations
2. Segmentation sorting
3. Bit operations

Solution 1: Database sorting
Import text files to the database to enable the database to sort indexes and then extract data to files.

Advantage: simple operation
Disadvantage: The operation speed is slow and the database device is required.

Solution 2: Segmented sorting
Operation Method:
Specify the memory size. For example, 200 MB and 52428800 MB can be used to record 50 million records. We can extract records each time to sort the files. It takes 20 times to fill up 9-digit integers, therefore, a total of 20 sorting operations are required, and 20 read operations are required for the files.

Disadvantages:
Complicated coding and slow speed (at least 20 searches)

Key steps:
Segment the entire 9-digit integer, and divide hundreds of millions of data records into 20 segments, each of which contains 50 million
Search for 0 ~ 50 million, 50000001 ~ 0.1 billion ......
Save the sorting result to a file

Solution 3: Bit operations
Consider the following questions:
The maximum 9-digit integer is 999999999.
The 0.9 billion pieces of data are not repeated.
Can I make the data into a queue or array so that it has 0 ~ 999999999 (1 billion) Elements
The array subscript indicates a value. 0 indicates that this number does not exist in the node, and 1 indicates that this number exists.
It is enough to judge whether 0 or 1 is stored in only one bit.

Declare a Bit Array (1 billion) that can contain 9-digit integers. A total of 1 billion/8 = 120m memory is required.
Initialize all data in the memory to 0
Read the data in the file and put the data into the memory. For example, if you read a data value of 341245909, find the bit value of 341245909 in the memory and set the bit value to 1.
Traverse the entire Bit Array and store the array subscript of bit 1 into the file

Key code
Check whether the data stored in the first second in a char is 1

Bool comparebit (unsigned char first, int second)
{
Const static int mark_buf [] = {0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 };
If (second> 8)
Return false;

Return (first & mark_buf [second]) = mark_buf [second];
}

Set the position of the source in a char (DESC) to 1.

Bool writetobit (unsigned char * DESC, int source)
{
Const static int mark_buf [] = {0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 };

If (source> 8)
Return false;

Desc [0] | = mark_buf [Source];

Return true;
}

Case
In a project, we need to delete duplicate records for 0.2 billion mobile phone numbers (the blacklist of filtered numbers is also valid)

The difficulty lies in how to deal with the 0.2 billion phone numbers. It is not realistic to store the phone numbers in a hash table directly. Even after optimization, an unsigned int is used to store a record, it also requires 0.2 billion * 4 = 0.8 billion bytes, which far exceeds the addressing capability of 32-bit systems.

Solution:
Converts a 12-digit string into an unsigned int (this is entirely possible because the phone number consists of the first three digits and the last eight digits, the last eight digits need to account for 1 ~ 10 million of the space, and 0 ~ 100 of digital storage is sufficient)
For simplicity, the default value is 0 ~ 4G numbers may be distributed. Therefore, we allocate 4g/32 = m memory.
Sort the 0.2 billion numbers into the unsigned int type and store them in the memory as described above (for example, we sorted 13512345678 as 112345678, and we found the subscript of memory limit 345678bit, and set this bit value to 1)
Traverse the entire Bit Array and record all numbers. These numbers are non-repeated mobile phone numbers.

Summary
Create a big enough Bit Array as a hash table
Returns an integer with the subscript of the bit array.
The value 0 or 1 in the bit field indicates whether the integer exists in the array.
Applicable to searches for non-duplicated raw data
Originally, each integer needs to change the 4-byte space to 1 bit, and the space compression rate is 32 times.
You can search for other types (including duplicate data) After extension.

Note:
Due to the restrictions of the operating system and programming language, the memory may be sufficient, but one piece of continuous large memory cannot be allocated. In this case, you can apply for multiple smaller memory blocks, and connect them using linked lists or other methods.

References

Programming Pearl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.