Mass data Processing One: an example

Source: Internet
Author: User

Topic:

Given an input file containing 4 billion nonnegative integers, design an algorithm that produces an integer that is not in the file. Assume that you have 1GB of memory to complete this task.

One or several numbers

1, 4 billion ==4*109~~22*230==232, which means that the integers are so many

2, 1gb==230b==8*230b~~80 billion, that is, if you use one to represent an integer, you can represent 8 billion integers (although not so much)

Ii. basic knowledge of C + + that needs to be used

1, byte type: Byte is not a keyword for C + + data type, if you want to use byte type, you can use unsigned char type (8 bit)

2, unsigned char and char, the first of the char type represents the sign, so the value is -128~127.

3, the maximum value of the type https://msdn.microsoft.com/en-us/library/296az74e (vs.80). aspx

1 #include <iostream>2usingnamespace  std; 3 void Main () 4 {5     cout<<int_max<<Endl; 6     cout<<char_max<<Endl; 7     cout<<uchar_max<<Endl; 8 }
View Code

4. Read File operation

1#include <iostream>2#include <fstream>3 using namespacestd;4 voidMain ()5 {6 ifstream infile;7Infile.open ("Data.txt");8     if(!infile) {9cerr<<"error:unable to open input file:"<<infile<<Endl;Ten         return; One     } A     Chars[Ten]; -      while(!infile.eof ()) -     { theInfile.getline (s),'\ n'); -cout<<s<<Endl; -     } -}
View Code

5, bit operation

<<: Left shift symbol--1<<2--> 00000001<<2--> 00000100 (corresponding >>)

&: With operation, all 1 takes 1, the remainder is 0 01010101 & 00001111==00000101 (corresponding | and ^)

+: Pay attention to distinguish between ' + ' and ' & ', ' + ' is sum, every 2 into 1

Third, bit vector

Vectors, vector, in Java is a size-changing array

Bit vectors, first of all a vector (array), each element occupies 1 bits of memory space, storage is 0 and 1.

Iv. Problem Solving

1, the idea: according to the initial analysis, if you use a representative of an integer, then 1G memory can be completely put down

(1) Create a one or two bit vector of 4 billion bits BV (using byte is strictly not called bit vector bar)

(2) Initialize the elements of BV to 0

(3) All numbers in the scanned file will be set to 1 for the current number

(4) Traverse BV from the beginning to return the first index with a value of 0

2, the data type uses the Byte,byte type to occupy 8 bits, therefore can represent 8 shaping numbers, given a number, how to calculate its position?

byte []BV = new Byte[num]

0:8th bit in the first element

7:1th bit in the first element

10:10>7, so the first element does not have its position, it is in the third position of the second element

...

For either integer n, where it is located at Byte[n/8], the 1<<n%8 bit, presumably this bright:

7 6 5 4 3 2 1 0, from 9 8,.....

That given a position, as above, the number that it represents, that is byte[i][1<<j] (that is, the first expression)

    byte[0][1<<4] = i*8+4 = 4

byte[1][1<<[1<<5]=1*8+5 = 13

3. Code

1#include <iostream>2#include <fstream>3 using namespacestd;4 voidMain ()5 {6 ifstream infile;7Infile.open ("Data.txt");8     if(!infile) {9cerr<<"error:unable to open input file:"<<infile<<Endl;Ten         return; One     } A     Chars[Ten]; -     inttemp=0; -UnsignedintNints = int_max+1; thecout<<nints<<Endl; -UnsignedChar*BV =NewUnsignedChar(nints/8);  -      while(!infile.eof ()) -     { +Infile.getline (s),'\ n'); -         //Cout<<typeid (s). Name (); +temp=Atoi (s); Acout<<temp<<Endl; atbv[temp/8] |=1<< (temp%8);  -     } -      for(intI=0;i<sizeof(BV); i++){ -          for(intj=0;j<8; j + +){ -             if((BV[I]&GT;&GT;J) &1) ==0){ -cout<<"Result:"<<i*8+j<<Endl; in                 return; -             } to         } +     } -}
View Code

In order to write this piece of code is really to kill ...

I use the old VC6.0 run will report errors, but the results are still calculated, it must be the problem of VC, hmm ...

But I didn't use a ton of data to test it.

The idea and solution of this problem refer to a question in the interview gold, but I changed to C + + version,

And it seems to give the code some small bug, the 18th line of judgment, not experimental, feeling is .... Anyway, my version has changed.

Five, Advanced: only use 10MB memory

When it comes to this problem, I don't promise to make it, but I think:

Given the amount of data and memory, let's probably estimate if the memory can be put down:

(1) can be put down, the simplest is the sort solved

(2) Not fit, then divide and conquer, divided into different sub-files, respectively solve

But now it seems that you can think:

(2) cannot be placed, but can be represented by a bit vector, the bit vectors can be used to solve the position vector

(3) Can not use the bit vector or the bit vector also can not put, it is only another to find his law.

For this problem, my first reaction is to divide the sub-file, and then solve, but saw another similar method ...

Ideas:

Obviously now this situation (1) (2) is not, can only be divided.

10mb=10*220b~223b ———— 221 Integers

So it's divided into chunks that can store 221 integers per block.

Number of chunks: 232/221=211--2000 block

What is the meaning of that?

A little adjustment, chunk size 220, block number 212

(1) Scan the entire file first

If it belongs to the interval [0,220-1], the chunk 1++; if it belongs to [220,221-1], the chunk 2++ ...

(2) Each chunk, see what the value of each chunk is

If the value <220, this chunk must have fewer elements

(3) in the minority block by means of the preceding bit vector method calculation

Problem:

Because there are duplicate numbers, what if there are quite a few in each interval? Is it possible to compute bit vectors for each chunk?

  

  

  

Mass data Processing One: an example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.