Crazy Bitmap-a large integer set with 12 GB of Bitmap generated without repeated random orders

Source: Internet
Author: User

The previous article describes how to use bitmap to sort non-duplicate data. The Sorting Algorithm is ready at a moment. I want to test the big data, because small data is quickly ranked in the memory.

I. dataset generation requirements

1. The data is an integer in the range of 0--2147483647 (2 ^ 31-1;

2. The dataset contains an integer ranging from 0 to 2 ^ 31-1 of 60%, that is, the number of kicked 40%;

3. No duplicate data in the dataset, that is, the two numbers are not equal;

4. The generated data should be in disorder as much as possible.

Ii. Solution Analysis

In the beginning, I just wanted to get a little bit of data. I thought the test data should meet the above requirements. When I wrote it, I found that it was easy to meet the first three requirements, the implementation is not easy to handle in disorder as much as possible. Calculate the approximate size of such data. Each integer is calculated by 10 characters, 60% 2 ^ 31 * 10B = 12 GB, it takes 12 GB space to exist in the disk. If memory is available, the integer is calculated as a 4-byte integer of 60% * 2 ^ 31 * 4B = 4.8 GB.

The 4th question in chapter 1 of "programming Pearl" is similar to the requirements here. The solution given in the book is as follows:

// Generate k random integers between 0 and n for I = [0, n) x [I] = ifor I = [0, k) swap (x [I], x [randint (I, n-1)]) // randint (a, B) generates a random number between [a, B]. swap (a, B) indicates switching, save (x [I])

The above solution is to create an array with n smaller and n smaller, which can be placed in the memory. According to the previous analysis, if an array with n = 2 ^ 31-1 is created, the required memory is 8 GB, so the memory cannot be stored, and operations such as swap (x [I], x [random]) cannot be performed. Maybe we can generate data that meets 1-3 conditions:

For (long num = 0; num <= LONG_MAX; num ++) {if (rand () <= 0.6 * RAND_MAX) // sample 60% saveData (num) using a random number );}

Next, we will perform out-of-order processing, and sorting algorithms. An optional solution is to perform segmented and merged out-of-order processing.

However, I am wondering why bitmap cannot be used for sorting. Inspired by the scattered list, a method for generating bitmap is designed. The procedure is as follows:

1. Apply for a 2147483647-Bit Bitmap B in the memory. The memory must be 2 ^ 31/8 B = Mb;

2. Set all bitmap bits to 0 (B [I] = 0), indicating that none of the 0-2147483647 BITs have been used;

3. Generate a random number random between 0-2147483647. In the in-place graph, check whether B [random] is equal to 0. If it is 0, it indicates that this number has not been used and the random is written to the file, set B [random] to 1. If it is 1, it indicates that this number has been used. Check whether random + 1 is equal to 0, if it is equal to 0, it is saved (random + 1), and the Union (random + 1) is 1. If it is not 0, then the system detects random-1, random + 2, random-2 ..., until a bit is 0, this is similar to the conflict processing of the hash list. Here I use swing linear detection.

The pseudocode is as follows:

Void generatorData () {B = new bitset (LONG_MAX); B. reset (); // set the bitmap to 0 count = 0; // counter while (count <= 0.6 * LONG_MAX) {random = getLongRand (); offset = 0; while (B [random + offset] = 1) {offset = getNextIndex (); // get the next probing offset} saveData (random + offset); count ++; B [random + offset] = 1 // This number has been used }}}

According to the algorithm, a random number is generated each time. If the random number is not used, it is saved. Otherwise, a number closest to the random number and not used is found for storage. There are two key points here. One is getLongRand (). the randomness of the random number produced by 0-LONG_MAX directly affects the randomness of the entire dataset. If getLongRand () is random, the data produced will also be random. The other one is getNextIndex (). If a random number has been used, it needs to be tested around it. The design of this detection sequence will affect the efficiency of the algorithm. If the detection always fails, it will take too much time for the probe, especially in the later stage, because many times have been used, and the number of probe requests has become much higher. If we use this algorithm to generate 100% instead of 60%, it will be very time-consuming. Imagine that the last few numbers always need to traverse the entire number space, but we only generate 60% of the data, 0 in the bitmap is not very sparse and does not require time-consuming queries.

The implementation code is as follows:

1/********* generate a sequence of left and right swing: 1,-1, 2,-2... * ************/2 long getNextIndex (long size, long index) {3 static short tag =-1; 4 static long left = 0; 5 static long right = 0; 6 if (index =-1) {// different indexes, you need to reset the static variable 7 tag =-1; 8 left = 0; 9 right = 0; 10} 11 if (index + (left-1) <0 & index + (right + 1)> = size) 12 return 0; // It has been traversed and does not need to be searched for 13 if (index + (left-1) <0) 14 return ++ right; // if the left side has reached the limit, test 15 if (index + (right + 1)> = size) 16 return -- left; // The boundary has been crossed on the right side. The test 17 on the left if (tag =-1) {// no boundary exists on the left and right, and the test 18 tags * =-1 on the left and right; 19 return ++ right; 20} else {21 tag * =-1; 22 return -- left; 23} 24} 25 26 void makePhoneNum (unsigned char * bitmap, long maxNum, short bitSize) {27 FILE * phoneNumFile = fopen ("phoneNumber.txt", "w"); 28 long count = 0; 29 long percent = 0.6 * maxNum; 30 while (true) {31 long index = randLong (bitSize); 32 long offset = 0; 33 while (find (bitmap, index + offset) = 1) {// This number has been used or does not exist 34 offset = getNextIndex (maxNum, index); 35 if (offset = 0) {// If the offset is 0, 36 fclose (phoneNumFile); 37 return; 38} 39} 40 getNextIndex (maxNum,-1) is used ); // reset the static variable 41 long loc = index + offset; 42 setOne (bitmap, loc); 43 fprintf (phoneNumFile, "% ld \ n", loc ); 44 if (++ count> percent) // Save the 80% termination 45 break; 46 if (count % 1000000 = 0) 47 printf ("count: \ t % ld \ n ", count); 48} 49 fclose (phoneNumFile); 50}

Generating Random Number randLong () is introduced separately in the next article. The next article will summarize the random number, which can also be viewed on Github.

After the data is generated, you can generate a file in descending order, and then sort it in ascending order to verify the sorting algorithm. It was found that it would take nearly two days to generate 12 GB of data, and it was slow when the number of times needed to be tested increased. This time it was a nightmare, but the result was not important, I am familiar with basic bitmap operations and have a new understanding of random numbers. I think this bitmap + conflict processing method is still very enlightening.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.