Outside sort, kill chicken with sledgehammer?

Source: Internet
Author: User

Previous article: http://blog.csdn.net/gsky1986/article/details/46499529

    • character sets and encodings
    • byte order
    • I/O mode
    • Memory
    • Disk
    • Thread/Synchronous/asynchronous
    • Data Features
character sets and encodings

Why should I consider the encoding of the file?
When you upload documents from Arabia to China, tell your Chinese friends to take an external sort, your Chinese friends may be silly:

What's up there? garbled .
You can also experience garbled characters like this:

echo"数"-f UTF-8 -t UNICODE t.txt??pe

Okay, you know. If you do not know the encoding of the file, you may parse it into garbled characters.

What is a character set?

CharSet-> Char-set, a collection of characters. such as UNICODE, ASCII

What is encoding?

encoding, the representation of a character. such as UTF-8, ASCII

character set and encoding relationships

You faint, I also faint, ASCII code how to be both character set and encoding?

Historically, character sets and encodings are synonymous, but they are not exactly the same, not a normative definition, how to understand it?

The character set, which often emphasizes its "supported" character range, is not supported by the set of characters outside of it. The set has a boundary, I give a representation within the boundary, I don't know how to express it outside the boundary.

Coding, often emphasizing the character for a character set, I'm going to translate it into a machine-understandable way-binary, if the character of a character set, my way of conversion and its consistent, then I am both the encoding is the character set, otherwise I just a character set conversion format.

So, Unicode is a character set, the characters it supports, within it, there is a representation, is this representation an encoding? No doubt, it's a code.

UTF-8 is an encoding that is a variable-length implementation of Unicode, what is the relationship between this encoding and Unicode encoding? Convert the relationship.

So to see the code or character set, often to see "context."

To give a simple example:

A Chinese character: "Number"

encoding/Character Set in-process value
Unicode 16 6570
Unicode 2 110010101110000
UTF-8 16 E6 B0
UTF-8 2 111001101001010110110000

In this example, you can experience this:

/// See Utf-8Echo"Number" >T.Txthexdump-XV - CT.Txt0000000    95e6    0Ab000000000E6 theB00A|....|00000004// See UnicodeIconv- FUtf-8 - TUnicode T.Txt|Hexdump-XV - C0000000Feff6570     theA00000000FF FE -  $ 0Axx                                 |..Pe..|00000006
byte order

If you notice that the kanji "number" is UTF-8 encoded after the 16 binary is E6 95 B0 , and then add a newline after it is E6 95 B0 0A , you will see in the following way:

-xv-C t.txt0000000    95e6    0ab0                                                00000000  950a                                       |....|00000004

by double-byte (16-bit) parsing, you get is 95 E6 0A B0 , you must be very confusing, but the egg, this is the root cause of the puzzle is the byte sequence.

Byte sequence is the order of bytes, if the two-byte (16-bit) Resolution E6 95 B0 0A , then read 16 bits, get E6 95 , byte-order reverse, get 95 E6 , reread 16 bits, get B0 0A , byte-order reverse, get 0A B0 , together is 95 E6 0A B0 .

This "counter-byte order" is called Little-Endian , small end or small tail.

Big -endian (Big-endian | BE)


Note: Images from wiki

As you can see, the highest byte of a number is placed at the low address of memory:
0A -> a + 0, this is the big-endian ( Big-Endian ), the highest byte of the number is called Most Significant Byte , which you can call the most effective byte or the most important byte, why is the most important?

For 1 numbers, its highest byte can reflect the sign of the number (plus or minus), and it can represent the size of the number more than the lower byte.

The features of the big-endian storage also determine its advantages, that is, approximate estimate of a number of the size and symbol, only the highest byte can be taken.

small End (Little-endian | LE)

Conversely, if the lowest byte of a number is placed at the low address of the memory:
0D -> a + 0, this is the small end ( Little-Endian ), the lowest byte of the number called Least Significant Byte , which can be called the least significant byte or the least important byte.

32-bit/64-bit address:

Number be
10 binary 16 binaryof bits (BITS)/bytes (bytes) LE
168496141 0A 0B 0C 0D 32/4 0A 0B 0C 0D 0D 0C 0B 0A
3085 0C 0D 16/2 0C 0D 0D 0C

It is clear that big-endian storage conforms to the habit of writing or reading from left to right, and why is it so troublesome to have a small end storage?

Small-end storage, take the number of times can be this:
First byte of a + 0
Second byte of A + 1
Third Byte, A + 2
Fourth Byte, A + 3
A base address a,+ different offsets to complete the different precision of the fetch operation, high?

how to differentiate between big or small ends?

If you notice:

///see Unicode  iconv -f  utf- 8  -t  Unicode T TXT |  hexdump -XV  -c  0000000  feff 6570  000  a 00000000  ff fe 70  65  0  a 00  |   Pe |  00000006  

Then the first 2 bytes feff represent the small end of the storage, very simple, fe < ff these 2 bytes are called Byte Order Mark , called byte-order tags, not only can be used as a distinguished file is the big-endian/small-end storage, can also represent the encoding.

As for the programmatic way to distinguish the big end small, this if you remember 字节序 the meaning of believe it is not difficult to write code:

        /// Java get byte order be/leField US = Unsafe.class.getDeclaredField ("Theunsafe"); Us.setaccessible (true); unsafe unsafe = (unsafe) us.get (NULL);LongA = Unsafe.allocatememory (8); Byteorder Byteorder =NULL;Try{Unsafe.putlong (A,0x0102030405060708L);byteb = Unsafe.getbyte (a);Switch(b) { Case 0x01: Byteorder = Byteorder.big_endian; Break; Case 0x08: Byteorder = Byteorder.little_endian; Break;default:assert false; Byteorder =NULL; }        }finally{unsafe.freememory (a); } System.out.println (Byteorder.tostring ());

It's easier to use C + +, and one union is done.

what does the byte order affect?

Character set and encoding if wrong, will lead to garbled, byte-order error, will lead to the wrong code, it is obvious 0C0D and 0D0C not the same number, is not it?

I/O mode

Since files are sorted and file I/O is essential, what are some of the classic I/O methods for Java that are worth exploring?

use character buffering to read and write files in a stream

The classic representation is BufferedReader/Writer , roughly speaking, that this approach is only suitable for character files (although the nature of the file is 0101).

Read a row of data from BufferedReader, if its character buffer is not enough, then

BufferedReader, Streamdecoder, FileInputStream, jni/native, read/c ...

Also, at the read/c level, you read only one byte at a time, although the operating system tries to do enough work in Block-buffer/page-cache to reduce disk I/O by pre-reading more physical blocks, but "system calls" and data copies of the kernel/user buffers, And a copy of the data in the JVM's heap of out-of-heap memory, byte[in the JVM's heap, Bytebuffer, Charbuffer, char[].

The decision to do so is not so efficient, despite the character buffering, which only limited the number of calls and copies.

read and write files in FileChannel with byte buffers

To read the data as an example, this way, compared to the character buffer + stream way, because the byte-oriented, so reduce the decoding step, nor do one byte at a time, at the read/c level, it "bulk" read the data, so this way more flexible and more efficient.

The disadvantage is that because of byte-oriented, if you parse a character file, then you need to implement the decoding steps yourself.
At the same time, because the file channel FileChannel is used to manipulate the file, it opens up or opens up direct memory () from the local Direct memory pool (), which DirectBytebuffer is a kind of out-of- DirectByteBuffer heap memory, so it is unavoidable to copy the data from the out-of-heap memory into the heap.

read and write files using byte buffer + mmap

What is Mmap?
Memory map, Memory-map.

Simply put, mmap through the virtual memory address of the user process and the access channel of the file, you can think that part of the file is mapped into a piece of memory that your process can access, so that you can manipulate the file like the memory, high.

This way of comparing the previous file I/O avoids the explicit read/c and avoids the data copy of the user/kernel buffers, so in general, this I/O approach is an efficient paradigm.

Since there is generally, there is a special, for the fixed length file support is very good, for the indefinite length of the growth of the file, it is very weak, look at the following method signature:

java.nio.channels.FileChannelpublicabstractmap(FileChannel.MapMode mode,                                              long position,                                              long size)                                      throws java.io.IOException

Note position and size , again, note that:

ifInteger.MAX_VALUE)            thrownew IllegalArgumentException("Size exceeds Integer.MAX_VALUE");

You know this way, a map cannot exceed 2G size at a time, and if the fixed-length file exceeds 2G, you need to make multiple map calls.
If the file grows at an indefinite length, it is not possible to know the exact one, which is size more efficient and the egg.
At the same time mmap eat memory, if you map a part of a file, and a lot of random write, then a lot of dirty pages will be caused, the operating system will be on a regular/on-demand processing of these dirty pages, in write-back a way to write the modified page to disk.

As you can imagine, because the page is random, then this persistent way will cause a lot of disk random I/O, which you understand.

Therefore, when the larger the file, the larger the map size , the probability of random I/O may be greater, then write-back the blocking and lock time will be very long, you stuck, dirty pages more and more, card more and more serious, the disk head will be very tangled, mmap not so efficient, and even slow to let you understand.

Memory

You should know that the use of your memory, not only your code new , but also your use of fast, merge sort of call stack, the deeper the call hierarchy of the recursive algorithm, the deeper the stack, the cost of stack stack will be larger, so you should be in the data collection is very small, do cutoff to insertion-sort not think that the insertion of sorting wood useful , it handles basic, ordered data and small amounts of data, is very efficient, and can significantly reduce the depth and overhead of your recursive calls.

If you use a merge sort, pay attention to allocating the space of the auxiliary array in advance, rather than allocating and destroying it in recursive calls, you should also notice that using the auxiliary array and the original array to copy each other to reduce the data copy between the two, so that your merge sort algorithm will be much more efficient.

At the same time, you should be aware that these stack messages are likely to be swapped into swap space, which means that you don't expect disk I/O to severely drag your performance down.

With regard to memory fragmentation, and decreasing the frequent birth of small objects and fractional groups, the use of pooling techniques can improve performance and reduce fragmentation.

Also, the int Integer size of the memory space used is very different, of course, if you use generics, this you can ignore.

Disk

There's not much to discuss about disks, this is mostly mechanical disks, and it's important to know that sequential I/O is the favorite of mechanical disks, and that the use of buffering, caching (pre-reading) and batching ideas to manipulate data can reduce the percentage of time that disk I/O takes.

thread/synchronous/asynchronous

In the previous article we used a single-threaded + synchronous approach to dealing with external sorting problems without a brain, and now we can simply discuss whether there is a better way.

For synchronous/asynchronous, blocking/non-blocking, here's a simple mention:

Synchronous/asynchronous, focusing on the order of execution of multiple objects or threads, is a contractual relationship of logical order.

Blocking/non-blocking, focusing on the execution state of an object or thread, a thread that goes into a blocking state, and the allocated CPU time is confiscated.

Back to our external sorting problem, for a single large file, bigdata(4663M,5亿行) because it is a character file, it is difficult to use multithreading to read different parts of the file at the same time, because you cannot know the offset of the newline character of a row of data in the first few bytes, so the read is single-threaded, and can take full advantage of the order i/ O (if the physical block of the file itself is a large number of consecutive words).

Because reading and collecting a certain number of data into the memory buffer in the row, memBuffer and sorting, so in the read, you will encounter blocking situation, iowait inevitably, this time, no other threads in the process of sorting, this is a cpu great waste of the and read-by-line often does not allow the disk to be fully turned up, it also wastes the I/O capability.

So in the sort of time, because it is single-threaded, so I am unable to do I/O, this is where the tangle.

If the file is not very large, this way simple and quick to save the brain, if the file is great, this way is probably too simple and rough.

Consider taking full advantage cpu (multi-core words), and fully let the disk up to maximize the I/O, we can eliminate the sorting operation in this step, as soon as possible to split large files into multiple unordered small files.

Here, we have two options:

1. Single-threaded, asynchronous, non-blocking I/O
因为是非阻塞I/O,所以线程除了被操作系统调度放弃cpu外,不会因为I/O而放弃cpu时间,这个时间可以用来处理排序,当I/O就绪,异步回调会将数据奉送到线程的堆栈当中(memBuffer),你就可以将这批新鲜热乎的数据挂入待排序的队列,然后接着处理排序。这种方式,将线程的上下文切换时间降至最低,是cpu利用+的方式,但是无法利用多核cpu的优势。同时,实现显然比较复杂。
2. Multithreading, synchronizing, blocking I/O
很简单,将无序小文件平均分配给多个线程,多个线程是多少呢?根据你的要求,如果在意吞吐量,那么就无法避免多个线程的上下文切换,开更多的线程处理吧,性能不会那么高。如果在意性能,一般取2、4、8、16,最多不要超过cpu物理核心的2倍。当然,并发度的高低,都得看测试的结果,上面只是经验数值。这种方式,建立在如下这个基础之上:let t1 = 大文件I/O而闲置的cpu时间(不能排序) + memBuffer排序浪费的I/O时间(不能I/O)let t2 = 多线程上下文切换时间 + 因多个文件同时I/O造成的随机I/O时间如果 t2 < t1 ,那么这种方式可选。事实上,根据实测,文件越大,这种方式的优势越明显,你可以试试。
File Data features

Having said so much, the key is the data characteristics of the document itself.
If it is a character file, UTF-8 encoded, an integer is 99999999 written as a string into the file, which requires 8 bytes to represent;
And if the byte-oriented, big-endian, 4 bytes is enough:

    staticprivateintmakeInt(bytebytebytebyte b0) {        return (((b3       24) |                0xff16) |                0xff) <<  8) |                0xff)      ));    }

In the previous question, said that 500 million plastic data, using UTF-8 coded characters, each line also to \n end, the total data volume in about 4663M;
and a fixed length of 4 bytes of the way to encode, the total data volume of about 1908M;
Directly hit 41 percent, pro.

The smaller the data, the less I/O, the time to rub down:

file Start/End data format Data Volume Start/End file Size (M) Programme time (s)
Bigdata Characters/characters 500 million 4663/4663 Divided into several ordered small files, then a multi-way merge sort 772
Bigdata Characters/bytes 500 million 4663/1908 1. Divided into a number of disorderly small files;
2. Four thread synchronization, blocking I/O for concurrent sequencing get some ordered small files;
3. Multi-merge sort.
430
Bigdata Characters/characters 500 million 4663/4663 Bitmap method 191

Outside sort, kill chicken with sledgehammer?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.