How can we sort large files in an integer of 0.5 billion ?, 0.5 billion full files

Last Update:2015-06-16 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How can we sort large files in an integer of 0.5 billion ?, 0.5 billion full files
Problem

1 file for youbigdata, The size is 4663 MB, and the number is 0.5 billion. The data in the file is random, and the following line is an integer:

61963023557681612158020393452095006174677379343122016371712330287901712966901...7005375

How can I sort this file now?

Internal sorting

First, try the internal sorting. Select two sorting methods:

3 route entries:

Private final int cutoff = 8; public <T> void perform (Comparable <T> [] a) {perform (a, 0,. length-1);} private <T> int median3 (Comparable <T> [] a, int x, int y, int z) {if (lessThan (a [x], a [y]) {if (lessThan (a [y], a [z]) {return y;} else if (lessThan (a [x], a [z]) {return z;} else {return x ;}} else {if (lessThan (a [z], a [y]) {return y ;} else if (lessThan (a [z], a [x]) {return z;} else {return x ;}}} private <T> void perform (Comparable <T> [] a, int low, int high) {int n = high-low + 1; // when the sequence is very small, sort by insert if (n <= cutoff) {InsertionSort insertionSort = SortFactory. createInsertionSort (); insertionSort. perform (a, low, high); // when the sequence is hour, use median3} else if (n <= 100) {int m = median3 (a, low, low + (n >>> 1), high); exchange (a, m, low); // when the sequence is large, use ninther} else {int gap = n >>> 3; int m = low + (n >>> 1); int m1 = median3 (a, low, low + gap, low + (gap <1); int m2 = median3 (a, m-gap, m, m + gap); int m3 = median3 (, high-(gap <1), high-gap, high); int ninther = median3 (a, m1, m2, m3); exchange (a, ninther, low );} if (high <= low) return; // lessThan int lt = low; // greaterThan int gt = high; // center Comparable <T> centers = a [low]; int I = low + 1;/** variant: * a [low .. lt-1] Less than upper-> Front (first) * a [lt .. i-1] is equal to middle-> middle (middle) * a [gt + 1 .. n-1] greater than upper-> rear (final) ** a [I .. gt] area to be viewed */while (I <= gt) {if (lessThan (a [I], Hangzhou) {// I->, lt-> exchange (a, lt ++, I ++);} else if (lessThan (clerk, a [I]) {exchange (a, I, gt --) ;}else {I ++ ;}// a [low .. lt-1] <v = a [lt .. gt] <a [gt + 1 .. high]. perform (a, low, lt-1); perform (a, gt + 1, high );}

Merge Sorting:

/*** When the value is less than or equal to this value, insert it for sorting */private final int cutoff = 8; /*** sort the given element sequence ** @ param a specify the element sequence */@ Override public <T> void perform (Comparable <T> []) {Comparable <T> [] B =. clone (); perform (B, a, 0,. length-1);} private <T> void perform (Comparable <T> [] src, Comparable <T> [] dest, int low, int high) {if (low> = high) return; // when the value is less than or equal to cutoff, insert the sorted if (high-low <= cutoff) {SortFactory. createInsertionSort (). perform (dest, low, high); return;} int mid = low + (high-low) >>> 1); perform (dest, src, low, mid ); perform (dest, src, mid + 1, high); // consider partial ordered src [mid] <= src [mid + 1] if (lessThanOrEqual (src [mid], src [mid + 1]) {System. arraycopy (src, low, dest, low, high-low + 1);} // src [low .. mid] + src [mid + 1 .. high]-> dest [low .. high] merge (src, dest, low, mid, high);} private <T> void merge (Comparable <T> [] src, Comparable <T> [] dest, int low, int mid, int high) {for (int I = low, v = low, w = mid + 1; I <= high; I ++) {if (w> high | v <= mid & lessThanOrEqual (src [v], src [w]) {dest [I] = src [v ++];} else {dest [I] = src [w ++] ;}}

Too much data, too deep recursion-> stack overflow? Increase Xss?
Too much data, too long array-> OOM? Increase Xmx?

Lack of patience. in addition, it is necessary to read such a large file into the memory, maintain such a large amount of data in the heap, and constantly copy the data in the inner row, which puts a lot of pressure on the stack and heap and is not universal.

Run the sort command

sort -n bigdata -o bigdata.sorted

How long has it been running?24 minutes.

Why is it so slow?

Let's take a rough look at our resources:
1. Memory
Jvm-heap/stack, native-heap/stack, page-cache, block-buffer
2. External Store
Swap + Disk

Large amounts of data, many function calls, many system calls, a lot of kernel/user buffer copies, a lot of dirty page write back, io-wait is very high, io is very busy, stack data is constantly exchanged to swap, with many thread switches and many locks in each link.

In short, the memory is too tight, and the disk requires space. The persistence of dirty data causes frequent cache failures, resulting in a large number of write-back requests and high write-back threads. As a result, a large amount of cpu time is used for context switching and everything is needed, they were all terrible, so they couldn't stand it without looking at it for 24 minutes.

Bitmap Method

    private BitSet bits;    public void perform(            String largeFileName,            int total,            String destLargeFileName,            Castor<Integer> castor,            int readerBufferSize,            int writerBufferSize,            boolean asc) throws IOException {        System.out.println("BitmapSort Started.");        long start = System.currentTimeMillis();        bits = new BitSet(total);        InputPart<Integer> largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);        OutputPart<Integer> largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);        largeOut.delete();        Integer data;        int off = 0;        try {            while (true) {                data = largeIn.read();                if (data == null)                    break;                int v = data;                set(v);                off++;            }            largeIn.close();            int size = bits.size();            System.out.println(String.format("lines : %d ,bits : %d", off, size));            if(asc) {                for (int i = 0; i < size; i++) {                    if (get(i)) {                        largeOut.write(i);                    }                }            }else {                for (int i = size - 1; i >= 0; i--) {                    if (get(i)) {                        largeOut.write(i);                    }                }            }            largeOut.close();            long stop = System.currentTimeMillis();            long elapsed = stop - start;            System.out.println(String.format("BitmapSort Completed.elapsed : %dms",elapsed));        }finally {            largeIn.close();            largeOut.close();        }    }    private void set(int i) {        bits.set(i);    }    private boolean get(int v) {        return bits.get(v);    }

Nice! It took 190 seconds, and it took 3 minutes.
Core Memory4663M/32The size of the Space runs out of such a result, and a lot of time is used for I/O, good.

The problem is, what if one or two Memory Stick breaks down at this time, or there is only a small amount of memory space?

External sorting

This external sorting is on the stage.
What about external sorting?

1. Minute

Maintains a very small core buffer in the memory.memBuffer, Convert large filesbigdataRead by row and collectmemBufferWhen the file is full or largememBufferInOrdered resultsWrite Disk Filesbigdata.xxx.part.sorted.
RecyclingmemBufferUntil the processing of large files is completed, n sequential disk files are obtained:

2. Integration

Now we have n small ordered files. How can we combine them into one large ordered file?
Read all the small files into the memory, and then arrange them internally?
(⊙ O ⊙ )...
No!

Use the following principles for Merge Sorting:

Here is a simple example:

File 1:3, 6, 9
File 2:2, 4, 8
File 3:1, 5, 7

First round:
Minimum value of file 1: 3, which is placed in the 1st rows of file 1
Minimum value of file 2: 2, which is placed in the 1st rows of file 2
Minimum value of file 3: 1, which is the first row of file 3.
The minimum values of the three files are: min (, 3) = 1.
That is to say, the current minimum value of the final large file is the minimum value of the current minimum values of files 1, 2, and 3?
The minimum value is 1, which is used to write large files.

Round 2:
Minimum value of file 1: 3, which is placed in the 1st rows of file 1
Minimum value of file 2: 2, which is placed in the 1st rows of file 2
Minimum value of file 3: 5, which is the first row of file 3.
The minimum values of the three files are: min (, 2, 3) = 2.
Write 2 to a large file.

That is to say, if the minimum value belongs to a file, a row of data is removed from the file.(Because small files are ordered internally, the next row of data represents its current minimum value)

The final time was 771 seconds, about 13 minutes.

less bigdata.sorted.text...9999966999996799999689999969999997099999719999972999997399999749999975999997699999779999978...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More