Large files, 500 million integers, how do rows?

Source: Internet
Author: User
Tags bitset comparable

problem

Give you 1 files bigdata , size 4663m,5 billion, the data in the file is random, such as the next line an integer:

61963023557681612158020393452095006174677379343122016371712330287901712966901...7005375

Now you want to sort this file, what's wrong?

Internal sorting

Try the inner row first, and choose 2 sorts of options:

3-Way Quick line:
Private Final intcutoff =8; Public<T>void Perform(comparable<t>[] a) {Perform (A,0, A.length-1); }Private<T>int median3(Comparable<t>[] A,intXintYintZ) {if(LessThan (A[x],a[y])) {if(LessThan (A[y],a[z])) {returnY }Else if(LessThan (A[x],a[z])) {returnZ }Else{returnX }        }Else{if(LessThan (A[z],a[y])) {returnY }Else if(LessThan (a[z],a[x])) {returnZ }Else{returnX }        }    }Private<T>void Perform(Comparable<t>[] A,intLowintHigh) {intn = high-low +1;//When the sequence is very small, sort by insert        if(n <= Cutoff)            {Insertionsort Insertionsort = Sortfactory.createinsertionsort (); Insertionsort.perform (A,low,high);//When the sequence is in hours, use median3}Else if(N <= -) {intm = median3 (A,low,low + (n >>>1), high); Exchange (A,m,low);//When the sequence is larger, use Ninther}Else{intGap = n >>>3;intm = low + (n >>>1);intM1 = MEDIAN3 (A,low,low + Gap,low + (Gap <<1));intM2 = median3 (a,m-gap,m,m + gap);intM3 = median3 (A,high-(Gap <<1), High-gap,high);intNinther = Median3 (A,M1,M2,M3);        Exchange (A,ninther,low); }if(High <= Low)return;//lessthan        intlt = Low;//greaterthan        intGT = high;//Center pointcomparable<t> pivot = A[low];inti = low +1;/* * Invariant: * A[low. LT-1] Less than pivot-front (first) * a[lt. I-1] equals pivot, middle (middle) * a[gt+1..n-1] greater than pivot, rear (final) * * A[i: GT] area to be inspected * /         while(I <= GT) {if(LessThan (A[i],pivot)) {//i->, ltExchange (a,lt++,i++); }Else if(LessThan (Pivot,a[i]))            {Exchange (a,i,gt--); }Else{i++; }        }//A[low. Lt-1] < v = a[lt: GT] < A[gt+1..high].Perform (A,low,lt-1); Perform (A,GT +1, high); }
Merge Sort:
    /** * is less than or equal to this value, give the insertion sort */    Private Final intcutoff =8;/** * Sort the given sequence of elements * * @param A given sequence of elements */    @Override     Public<T>void Perform(comparable<t>[] a)        {comparable<t>[] b = a.clone (); Perform (b, a,0, A.length-1); }Private<T>void Perform(comparable<t>[] src,comparable<t>[] dest,intLowintHigh) {if(Low >= High)return;//less than or equal to cutoff, give the insertion sort        if(High-Low <= Cutoff) {Sortfactory.createinsertionsort (). Perform (Dest,low,high);return; }intMID = Low + ((high-low) >>>1);        Perform (DEST,SRC,LOW,MID); Perform (Dest,src,mid +1, high);//considering local order Src[mid] <= src[mid+1]        if(Lessthanorequal (src[mid],src[mid+1]) {system.arraycopy (Src,low,dest,low,high-low +1); }//src[low: Mid] + src[mid+1 : high], dest[low. High]Merge (Src,dest,low,mid,high); }Private<T>void Merge(comparable<t>[] src,comparable<t>[] dest,intLowintMidintHigh) { for(inti = Low,v = Low,w = mid +1; I <= high; i++) {if(W > High | | v <= mid && lessthanorequal (Src[v],src[w]))            {Dest[i] = src[v++]; }Else{Dest[i] = src[w++]; }        }    }

Too much data, recursive too deep, stack overflow? Increase XSS?
Too much data, an array too long--OOM? Increase Xmx?

Patience is not enough, did not run out. And to read such a large file into memory, maintaining such a large amount of data in the heap, as well as a constant copy of the inner row, is a lot of pressure on stacks and heaps, not universal.

Sort command to run
-n-o bigdata.sorted

How long has it been running? 24 minutes.

Why is it so slow?

Take a cursory look at our resources:
1. Memory
Jvm-heap/stack,native-heap/stack,page-cache,block-buffer
2. External memory
Swap + disk

Data volume is very large, function calls many, system calls many, kernel/user buffer copy many, dirty page write a lot, io-wait is very high, Io is very busy, the stack data is exchanged continuously to swap, the thread switches a lot, each link lock also many.

In short, the memory is tight, ask the disk to space, too many persistent dirty data causes the cache to fail frequently, causing a lot of write-back, high write-back thread, resulting in CPU time for the context switch, everything, is bad, so 24 minutes not to look at, unbearable.

Bitmap Method
    PrivateBitSet bits; Public void Perform(String Largefilename,intTotal, String destlargefilename, castor<integer> Castor,intReaderbuffersize,intWriterbuffersize, Boolean asc) throws IOException {System. out. println ("Bitmapsort Started.");LongStart = System.currenttimemillis (); BITS =NewBitSet (total);        inputpart<integer> Largein = Partfactory.createcharbufferedinputpart (Largefilename, readerBufferSize); outputpart<integer> largeout = Partfactory.createcharbufferedoutputpart (Destlargefilename, WriterBufferSize)        ;        Largeout.delete (); Integer data;intOff =0;Try{ while(true) {data = Largein.read ();if(Data = =NULL) Break;intv = data;Set(v);            off++; } largein.close ();intSize = Bits.size (); System. out. println (String.Format ("lines:%d, bits:%d", off, size));if(ASC) { for(inti =0; i < size; i++) {if(Get(i)) {largeout.write (i); }                }            }Else{ for(inti = size-1; I >=0; i--) {if(Get(i)) {largeout.write (i); }}} largeout.close ();LongStop = System.currenttimemillis ();Longelapsed = Stop-start; System. out. println (String.Format ("Bitmapsort completed.elapsed:%dms", elapsed)); }finally{Largein.close ();        Largeout.close (); }    }Private void Set(inti) {bits.Set(i); }PrivateBooleanGet(intV) {returnBits.Get(v); }

Nice! ran for 190 seconds, 3 Philaichong.
4663M/32running out of this result in a space of core memory, and a lot of time for I/O, yes.

The problem is, if this time suddenly the memory bar is broken 1, 2, or only very little memory space how to do?

External Sort

The external sort is on the pitch.
What's the external sort?

  1. In the case of very little memory, using external memory to save the intermediate result, and then use the multi-way merge to sort;
  2. Map-reduce's direct descendant.


1. Sub-

In memory, a very small core buffer is maintained memBuffer , the large files are bigdata read in rows, and memBuffer when full or large files are read, the memBuffer data in the call is sorted, and the ordered results are written to the disk file bigdata.xxx.part.sorted .
Recycle memBuffer until a large file is processed, and get N ordered disk files:

2. Hopewell

Now that you have n ordered small files, how do you merge them into 1 large, ordered files?
Read all the small files into memory, and then inside the row?
(⊙o⊙) ...
No!

Use the following principles for merge sorting:

Let's give a simple example:

File 1:3, 6,9
File 2:2, 4,8
File 3:1, 5,7

First round:
Minimum value of file 1:3, row 1th of file 1
Minimum value of File 2:2, row 1th of File 2
Minimum value of File 3:1, Row 1th of file 3
So, the minimum value in these 3 files is: min (three-in-one) = 1
In other words, the current minimum value of the final large file is the minimum value of the current minimum value of the file 1, 2, 3, around?
It takes a minimum value of 1 and writes a large file.

Second round:
Minimum value of file 1:3, row 1th of file 1
Minimum value of File 2:2, row 1th of File 2
Minimum value of File 3:5, Row 2nd of file 3
So, the minimum value in these 3 files is: min (5,2,3) = 2
Writes 2 to a large file.

That is , the minimum value belongs to which file, then a row of data is removed from which file. (Because the small file is ordered internally, the next line of data represents its current minimum value)

The final time, ran for 771 seconds, 13 minutes or so.

less bigdata.sorted.text...9999966999996799999689999969999997099999719999972999997399999749999975999997699999779999978...

Large files, 500 million integers, how do rows?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.