500 million integers of large files, how to row?

Last Update:2015-06-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Topics and backgrounds can be see here: http://weibo.com/p/1001603856172376577500 and Http://blog.jobbole.com/87600/

Here's a clear question: Given a large file, contains 500 million integers, each of which belongs to 1-9999999. Design the scenario, sort the elements, and write the sort results to the file output. No more than 2G of usable memory.

Note that there are 500 million elements, so there must be duplicate elements in the file, which means that some elements appear more than once, so a naïve bitmap is not qualified for this problem.

Because the data set provided by Uncle Lau is too cumbersome to download, it generates one manually.

There are two options available:

Scenario I: Because each element does not exceed 9999999, consider using Countingsort (count sort) to count the occurrences of each element. If the file 1 appears three times, 2 appears three times, 5 appears two times, then the final result is 11122255. For further introduction of Countingsort, you can refer to the introduction to the algorithm

#include <iostream>using namespacestd;Const intMAX =9999999;intC[max +1] = {0 };voidCountingsort () {CharFilename_in[] ="E:\\in.txt"; FILE*fp_in = fopen (filename_in,"R"); Charcontent[Ten];  while(!feof (fp_in)) {fgets (content,Ten, fp_in); intelement =atoi (content); ++c[element];//Counting ...} fclose (fp_in); //Output    CharFilename_out[] ="E:\\out.txt"; FILE*fp_out = fopen (Filename_out,"W");  for(inti =1; I <= max;i++)    {         for(intj =1; J <= C[max]; J + +) {fprintf (fp_out,"%d\n", i); }} fclose (Fp_out);}intMain () {countingsort (); System ("Pause"); return 0;}

Scenario II

We can use the K-way merge strategy, which is the external sort mentioned in the original. We first read the file by line, sorted in batches, output K temporary small files, each small file is ordered. In this way, we get the K-ordered small file, the problem is converted to the K-ordered small file merge.

Description

1, in the original comment list, Netizen Iduanyingjie's comment "The 2nd part of the external sort" is not necessary to take a minimum value each time. Because file 1, file 2, file 3 the smallest value in each file, is certainly the smallest of the three files of 3 values, the three values are sorted (three-by-one), directly to the large file. In the second round, the 4,5,6 is not correct. Consider the following three ordered small files:

20 50 60

40 45 70

70 85 90

Preferred to remove 20, 40, 70, sorted after the output of the smallest element 20, at this time the correct way is from the first file (that is, the smallest element 20 is located in the file) read 50, and then from 50, 40, 70 to select the smallest element, that is, 50, output. Then from the 50 file is the first file read 60, from 60, 40, 70 to choose the smallest element is 40 output. And so on At this time, and can not follow the Netizen Iduanyingjie said simple sort output, sort output.

Therefore, it is appropriate to read the K element, output the smallest element, and then read the next element from the file where the smallest element is located, and continue to select the minimum element output until all elements in the K-ordered file have been processed.

2, since selecting the minimum value is a critical operation, you can use the smallest heap containing k elements to do it efficiently. The entire process is converted to: output minimum, replace heap top, rebuild heap, output minimum, replace heap top, rebuild heap ...

3, when the elements of a small file have been processed, the strategy is to move the elements at the bottom of the heap to the top of the heap and subtract the heap size by 1. This way, when the heap size is 0, it means that all the elements in the file have been processed. In order to know the file where the smallest element resides, we need a variable to record the file number.

#include <iostream>using namespacestd;Const intK = -;structNode {intvalue; intfile_id;};classheap{ Public: Heap (intcapacity) {         This->capacity =capacity;  This->size =0; P=Newnode[ This->capacity +1]; }    voidbuildminheap () { for(inti = size/2; I >=1; i--) {minheapify (i); }                }    voidInsertnode (Node t) {size++; P[size].value=T.value; p[size].file_id=t.file_id; } Node Getmin () {returnp[1]; }    intgetheapsize () {returnsize; }    voidMinheapify (inti) {intleft =2*i; intright =2* i +1; intMin =i; if(left <= size && P[left].value <p[i].value) {min=Left ; }                    if(Right <= size && P[right].value <p[min].value) {min=Right ; }                    if(min! =i) {Node tmp=P[i]; P[i]=P[min]; P[min]=tmp;        Minheapify (min); }    }    voidReplacerootnodebylastnode () {p[1] =P[size]; Size--; }    voidreplacerootnodebynextelement (Node N) {p[1] =N; }    ~Heap () {Delete[]p; }Private: Node*p; intsize; intcapacity;}; Node getoneelement (FILE* & FP,intfile_id) {    Charcontent[Ten];    Node ret; if(!feof (FP)) {fgets (content,Ten, FP); intelement =atoi (content); ret.file_id=file_id; Ret.value=element; }    Else{Ret.value= -1; }    returnret;}voidWriteresult (file* & FP,intElement) {fprintf (FP),"%d\n", Element);}voidKwaymergeviaheap (file**FP) {Heap*pheap =NewHeap (K);  for(inti =0; I <K; i++) {Node T=getoneelement (Fp[i], i); Pheap-Insertnode (t); } pheap-buildminheap (); CharFilename_output[] ="E:\\outt.txt"; FILE*fp_out = fopen (Filename_output,"W");  while(Pheap->getheapsize () >0) {Node Minnode= pheap->getmin ();        Writeresult (Fp_out, Minnode.value); Node T=getoneelement (fp[minnode.file_id], minnode.file_id); if(t.value==-1)//there is no elements in fp[minnode.file_id]{pheap-Replacerootnodebylastnode (); }        Else{pheap-replacerootnodebynextelement (t); } pheap->minheapify (1); }    //     for(inti =0; I <K; i++) {fclose (fp[i]);    } fclose (Fp_out); Deletepheap;}

1, the implementation of the heap was written by the university, with reference to the introduction of algorithms, although further acceleration and optimization is possible. For example, the end of the Minheapify function is a tail recursion that can be replaced by a loop. Of course, the compiler might do it automatically after you turn on compiler optimizations, so it's really not necessary to use a shift operation for this kind of 2*i statement. For example, the interface design in Getoneelement, returning node means that the overhead of the constructor is required, and of course it is designed to return only int better, but the latter is a bit more troublesome.

2, the heap is a very useful data structure, not only can be used for K-road merging, also can be used for K-Road intersection (to find K-order linked list/file), try.

3, for the output, you can actually save a certain amount of elements through a buffer, and then swipe to the hard disk. Trouble, don't write it.

4, why does the IO part use C, while the other parts are in C + +? This is because C + + Io is too slow, slow scary.

500 million integers of large files, how to row?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

500 million integers of large files, how to row?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

500 million integers of large files, how to row?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support