How can we sort large files in an integer of 0.5 billion ?, 0.5 billion full files
Problem
1 file for youbigdata
, The size is 4663 MB, and the number is 0.5 billion. The data in the file is random, and the following line is an integer:
61963023557681612158020393452095006174677379343122016371712330287901712966901...7005375
How can I sort this file now?
Internal sorting
First, try the internal sorting. Select two sorting methods:
3 route entries:
Private final int cutoff = 8; public <T> void perform (Comparable <T> [] a) {perform (a, 0,. length-1);} private <T> int median3 (Comparable <T> [] a, int x, int y, int z) {if (lessThan (a [x], a [y]) {if (lessThan (a [y], a [z]) {return y;} else if (lessThan (a [x], a [z]) {return z;} else {return x ;}} else {if (lessThan (a [z], a [y]) {return y ;} else if (lessThan (a [z], a [x]) {return z;} else {return x ;}}} private <T> void perform (Comparable <T> [] a, int low, int high) {int n = high-low + 1; // when the sequence is very small, sort by insert if (n <= cutoff) {InsertionSort insertionSort = SortFactory. createInsertionSort (); insertionSort. perform (a, low, high); // when the sequence is hour, use median3} else if (n <= 100) {int m = median3 (a, low, low + (n >>> 1), high); exchange (a, m, low); // when the sequence is large, use ninther} else {int gap = n >>> 3; int m = low + (n >>> 1); int m1 = median3 (a, low, low + gap, low + (gap <1); int m2 = median3 (a, m-gap, m, m + gap); int m3 = median3 (, high-(gap <1), high-gap, high); int ninther = median3 (a, m1, m2, m3); exchange (a, ninther, low );} if (high <= low) return; // lessThan int lt = low; // greaterThan int gt = high; // center Comparable <T> centers = a [low]; int I = low + 1;/** variant: * a [low .. lt-1] Less than upper-> Front (first) * a [lt .. i-1] is equal to middle-> middle (middle) * a [gt + 1 .. n-1] greater than upper-> rear (final) ** a [I .. gt] area to be viewed */while (I <= gt) {if (lessThan (a [I], Hangzhou) {// I->, lt-> exchange (a, lt ++, I ++);} else if (lessThan (clerk, a [I]) {exchange (a, I, gt --) ;}else {I ++ ;}// a [low .. lt-1] <v = a [lt .. gt] <a [gt + 1 .. high]. perform (a, low, lt-1); perform (a, gt + 1, high );}
Merge Sorting:
/*** When the value is less than or equal to this value, insert it for sorting */private final int cutoff = 8; /*** sort the given element sequence ** @ param a specify the element sequence */@ Override public <T> void perform (Comparable <T> []) {Comparable <T> [] B =. clone (); perform (B, a, 0,. length-1);} private <T> void perform (Comparable <T> [] src, Comparable <T> [] dest, int low, int high) {if (low> = high) return; // when the value is less than or equal to cutoff, insert the sorted if (high-low <= cutoff) {SortFactory. createInsertionSort (). perform (dest, low, high); return;} int mid = low + (high-low) >>> 1); perform (dest, src, low, mid ); perform (dest, src, mid + 1, high); // consider partial ordered src [mid] <= src [mid + 1] if (lessThanOrEqual (src [mid], src [mid + 1]) {System. arraycopy (src, low, dest, low, high-low + 1);} // src [low .. mid] + src [mid + 1 .. high]-> dest [low .. high] merge (src, dest, low, mid, high);} private <T> void merge (Comparable <T> [] src, Comparable <T> [] dest, int low, int mid, int high) {for (int I = low, v = low, w = mid + 1; I <= high; I ++) {if (w> high | v <= mid & lessThanOrEqual (src [v], src [w]) {dest [I] = src [v ++];} else {dest [I] = src [w ++] ;}}
Too much data, too deep recursion-> stack overflow? Increase Xss?
Too much data, too long array-> OOM? Increase Xmx?
Lack of patience. in addition, it is necessary to read such a large file into the memory, maintain such a large amount of data in the heap, and constantly copy the data in the inner row, which puts a lot of pressure on the stack and heap and is not universal.
Run the sort command
sort -n bigdata -o bigdata.sorted
How long has it been running?24 minutes.
Why is it so slow?
Let's take a rough look at our resources:
1. Memory
Jvm-heap/stack, native-heap/stack, page-cache, block-buffer
2. External Store
Swap + Disk
Large amounts of data, many function calls, many system calls, a lot of kernel/user buffer copies, a lot of dirty page write back, io-wait is very high, io is very busy, stack data is constantly exchanged to swap, with many thread switches and many locks in each link.
In short, the memory is too tight, and the disk requires space. The persistence of dirty data causes frequent cache failures, resulting in a large number of write-back requests and high write-back threads. As a result, a large amount of cpu time is used for context switching and everything is needed, they were all terrible, so they couldn't stand it without looking at it for 24 minutes.
Bitmap Method
private BitSet bits; public void perform( String largeFileName, int total, String destLargeFileName, Castor<Integer> castor, int readerBufferSize, int writerBufferSize, boolean asc) throws IOException { System.out.println("BitmapSort Started."); long start = System.currentTimeMillis(); bits = new BitSet(total); InputPart<Integer> largeIn = PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize); OutputPart<Integer> largeOut = PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize); largeOut.delete(); Integer data; int off = 0; try { while (true) { data = largeIn.read(); if (data == null) break; int v = data; set(v); off++; } largeIn.close(); int size = bits.size(); System.out.println(String.format("lines : %d ,bits : %d", off, size)); if(asc) { for (int i = 0; i < size; i++) { if (get(i)) { largeOut.write(i); } } }else { for (int i = size - 1; i >= 0; i--) { if (get(i)) { largeOut.write(i); } } } largeOut.close(); long stop = System.currentTimeMillis(); long elapsed = stop - start; System.out.println(String.format("BitmapSort Completed.elapsed : %dms",elapsed)); }finally { largeIn.close(); largeOut.close(); } } private void set(int i) { bits.set(i); } private boolean get(int v) { return bits.get(v); }
Nice! It took 190 seconds, and it took 3 minutes.
Core Memory4663M/32
The size of the Space runs out of such a result, and a lot of time is used for I/O, good.
The problem is, what if one or two Memory Stick breaks down at this time, or there is only a small amount of memory space?
External sorting
This external sorting is on the stage.
What about external sorting?
1. Minute
Maintains a very small core buffer in the memory.memBuffer
, Convert large filesbigdata
Read by row and collectmemBuffer
When the file is full or largememBuffer
InOrdered resultsWrite Disk Filesbigdata.xxx.part.sorted
.
RecyclingmemBuffer
Until the processing of large files is completed, n sequential disk files are obtained:
2. Integration
Now we have n small ordered files. How can we combine them into one large ordered file?
Read all the small files into the memory, and then arrange them internally?
(⊙ O ⊙ )...
No!
Use the following principles for Merge Sorting:
Here is a simple example:
File 1:3, 6, 9
File 2:2, 4, 8
File 3:1, 5, 7
First round:
Minimum value of file 1: 3, which is placed in the 1st rows of file 1
Minimum value of file 2: 2, which is placed in the 1st rows of file 2
Minimum value of file 3: 1, which is the first row of file 3.
The minimum values of the three files are: min (, 3) = 1.
That is to say, the current minimum value of the final large file is the minimum value of the current minimum values of files 1, 2, and 3?
The minimum value is 1, which is used to write large files.
Round 2:
Minimum value of file 1: 3, which is placed in the 1st rows of file 1
Minimum value of file 2: 2, which is placed in the 1st rows of file 2
Minimum value of file 3: 5, which is the first row of file 3.
The minimum values of the three files are: min (, 2, 3) = 2.
Write 2 to a large file.
That is to say, if the minimum value belongs to a file, a row of data is removed from the file.(Because small files are ordered internally, the next row of data represents its current minimum value)
The final time was 771 seconds, about 13 minutes.
less bigdata.sorted.text...9999966999996799999689999969999997099999719999972999997399999749999975999997699999779999978...