500 million integers of large files, how to row?

Source: Internet
Author: User

Topics and backgrounds can be see here: http://weibo.com/p/1001603856172376577500 and Http://blog.jobbole.com/87600/

Here's a clear question: Given a large file, contains 500 million integers, each of which belongs to 1-9999999. Design the scenario, sort the elements, and write the sort results to the file output. No more than 2G of usable memory.

Note that there are 500 million elements, so there must be duplicate elements in the file, which means that some elements appear more than once, so a naïve bitmap is not qualified for this problem.

Because the data set provided by Uncle Lau is too cumbersome to download, it generates one manually.

There are two options available:

Scenario I: Because each element does not exceed 9999999, consider using Countingsort (count sort) to count the occurrences of each element. If the file 1 appears three times, 2 appears three times, 5 appears two times, then the final result is 11122255. For further introduction of Countingsort, you can refer to the introduction to the algorithm

#include <iostream>using namespacestd;Const intMAX =9999999;intC[max +1] = {0 };voidCountingsort () {CharFilename_in[] ="E:\\in.txt"; FILE*fp_in = fopen (filename_in,"R"); Charcontent[Ten];  while(!feof (fp_in)) {fgets (content,Ten, fp_in); intelement =atoi (content); ++c[element];//Counting ...} fclose (fp_in); //Output    CharFilename_out[] ="E:\\out.txt"; FILE*fp_out = fopen (Filename_out,"W");  for(inti =1; I <= max;i++)    {         for(intj =1; J <= C[max]; J + +) {fprintf (fp_out,"%d\n", i); }} fclose (Fp_out);}intMain () {countingsort (); System ("Pause"); return 0;}

Scenario II

We can use the K-way merge strategy, which is the external sort mentioned in the original. We first read the file by line, sorted in batches, output K temporary small files, each small file is ordered. In this way, we get the K-ordered small file, the problem is converted to the K-ordered small file merge.

Description

1, in the original comment list, Netizen Iduanyingjie's comment "The 2nd part of the external sort" is not necessary to take a minimum value each time. Because file 1, file 2, file 3 the smallest value in each file, is certainly the smallest of the three files of 3 values, the three values are sorted (three-by-one), directly to the large file. In the second round, the 4,5,6 is not correct. Consider the following three ordered small files:

20 50 60

40 45 70

70 85 90

Preferred to remove 20, 40, 70, sorted after the output of the smallest element 20, at this time the correct way is from the first file (that is, the smallest element 20 is located in the file) read 50, and then from 50, 40, 70 to select the smallest element, that is, 50, output. Then from the 50 file is the first file read 60, from 60, 40, 70 to choose the smallest element is 40 output. And so on At this time, and can not follow the Netizen Iduanyingjie said simple sort output, sort output.

Therefore, it is appropriate to read the K element, output the smallest element, and then read the next element from the file where the smallest element is located, and continue to select the minimum element output until all elements in the K-ordered file have been processed.

2, since selecting the minimum value is a critical operation, you can use the smallest heap containing k elements to do it efficiently. The entire process is converted to: output minimum, replace heap top, rebuild heap, output minimum, replace heap top, rebuild heap ...

3, when the elements of a small file have been processed, the strategy is to move the elements at the bottom of the heap to the top of the heap and subtract the heap size by 1. This way, when the heap size is 0, it means that all the elements in the file have been processed. In order to know the file where the smallest element resides, we need a variable to record the file number.

#include <iostream>using namespacestd;Const intK = -;structNode {intvalue; intfile_id;};classheap{ Public: Heap (intcapacity) {         This->capacity =capacity;  This->size =0; P=Newnode[ This->capacity +1]; }    voidbuildminheap () { for(inti = size/2; I >=1; i--) {minheapify (i); }                }    voidInsertnode (Node t) {size++; P[size].value=T.value; p[size].file_id=t.file_id; } Node Getmin () {returnp[1]; }    intgetheapsize () {returnsize; }    voidMinheapify (inti) {intleft =2*i; intright =2* i +1; intMin =i; if(left <= size && P[left].value <p[i].value) {min=Left ; }                    if(Right <= size && P[right].value <p[min].value) {min=Right ; }                    if(min! =i) {Node tmp=P[i]; P[i]=P[min]; P[min]=tmp;        Minheapify (min); }    }    voidReplacerootnodebylastnode () {p[1] =P[size]; Size--; }    voidreplacerootnodebynextelement (Node N) {p[1] =N; }    ~Heap () {Delete[]p; }Private: Node*p; intsize; intcapacity;}; Node getoneelement (FILE* & FP,intfile_id) {    Charcontent[Ten];    Node ret; if(!feof (FP)) {fgets (content,Ten, FP); intelement =atoi (content); ret.file_id=file_id; Ret.value=element; }    Else{Ret.value= -1; }    returnret;}voidWriteresult (file* & FP,intElement) {fprintf (FP),"%d\n", Element);}voidKwaymergeviaheap (file**FP) {Heap*pheap =NewHeap (K);  for(inti =0; I <K; i++) {Node T=getoneelement (Fp[i], i); Pheap-Insertnode (t); } pheap-buildminheap (); CharFilename_output[] ="E:\\outt.txt"; FILE*fp_out = fopen (Filename_output,"W");  while(Pheap->getheapsize () >0) {Node Minnode= pheap->getmin ();        Writeresult (Fp_out, Minnode.value); Node T=getoneelement (fp[minnode.file_id], minnode.file_id); if(t.value==-1)//there is no elements in fp[minnode.file_id]{pheap-Replacerootnodebylastnode (); }        Else{pheap-replacerootnodebynextelement (t); } pheap->minheapify (1); }    //     for(inti =0; I <K; i++) {fclose (fp[i]);    } fclose (Fp_out); Deletepheap;}

1, the implementation of the heap was written by the university, with reference to the introduction of algorithms, although further acceleration and optimization is possible. For example, the end of the Minheapify function is a tail recursion that can be replaced by a loop. Of course, the compiler might do it automatically after you turn on compiler optimizations, so it's really not necessary to use a shift operation for this kind of 2*i statement. For example, the interface design in Getoneelement, returning node means that the overhead of the constructor is required, and of course it is designed to return only int better, but the latter is a bit more troublesome.

2, the heap is a very useful data structure, not only can be used for K-road merging, also can be used for K-Road intersection (to find K-order linked list/file), try.

3, for the output, you can actually save a certain amount of elements through a buffer, and then swipe to the hard disk. Trouble, don't write it.

4, why does the IO part use C, while the other parts are in C + +? This is because C + + Io is too slow, slow scary.

500 million integers of large files, how to row?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.