Unicode can be seen as a ing. It defines a number code, which is associated with a character.
The early Unicode was a 16-bit Unicode. After 1996, the emergence of Unicode2.0 allowed the Unicode encoding range to be in hexadecimal notation). hexadecimal notation: 10 FFFF = binary: 100001111111111111111, that is, the current number is 21 characters.
UTFUnicode transformation format) is a ing algorithm that maps every Unicode code to a byte string. This ing is reversible. It can be understood that UTF is a Unicode implementation method, and UTF has multiple versions, as follows:
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" inherit "border =" 0 "alt =" 5F793E2E7ACA4D3A8E82DC420DE9DDFF "height =" 193 "src =" http://www.bkjia.com/uploads/allimg/131228/164SLF0-0.jpg "/>
Both UTF8 and UTF16 are variable-length encoding methods. Only UTF32 is the fixed-length encoding method. The reason for getting longer is that the fixed-length square sometimes occupies too much space.
The UTF-8 encoding rules are very simple, with only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the unicode code of this symbol. Therefore, for English letters, UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol n> 1), the first n bits of the first byte are set to 1, and the nth + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all unicode codes of this symbol.
You can fill in the following table for specific conversion. The following is an example. If the document contains many letters, UTF8 saves a lot of space than UTF16.
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" 18f85b1ffec94fb19ca4ed07988bf3 "border =" 0 "alt =" 18f85b1ffec94fb19ca4ed07988bf3 "height =" 176 "src =" http://www.bkjia.com/uploads/allimg/131228/164SIB5-1.jpg "/>
In windows, notepad or some other software displays the "Unicode" name in the encoding method list of the Save dialog box, which is very misleading, it is thought that unicode and UTF8 are in parallel. This is actually an early term (the habit of using earlier versions of windows NT has always been used). In fact, it refers to UTF16.
Big-endian, little-endian, small-end, and BOM (byte-order mark)
UTF16 is directly stored in UCS-2 format. Two bytes are required for storage. It is not specified which byte is high and which byte is low. Therefore, two methods are generated. For example, if a unicode code is 4E25, 4E is in front of the storage, 25 is in the Big endian mode, and 25 is in front of and 4E is in the back, is the Little endian method. Because of the special algorithm, UTF8 does not have a large-end or small-end statement.
These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The reason for the war was people's debate about whether to break out from Big-Endian or from Little-Endian when eating eggs. There were six wars in front and back for this purpose. One emperor gave his life and the other emperor lost his throne.
The computer does not know whether the file is in large or small order. Therefore, you must add a mark to the file header. In UTF8, the character EFBBBF is used to indicate the byte order. In UTF16, FEFF is used to indicate the byte order. If it appears at the beginning of the byte stream, it is used to identify the byte sequence of the byte stream, whether it is a high position or a low position. If it appears in the middle of the byte stream, it indicates the zero-width non-wrap space. The user looks like a space. Starting from Unicode3.2, U + FEFF can only appear at the beginning of the byte stream and can only be used to identify the byte sequence, as indicated by its name-byte sequence mark; other usage has been discarded. Instead, U + 2060 is used to express zero-width and zero-gap.
Representation of the byte sequence mark of different encodings:
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" 8EEF3B8405DC4462A7C024D6AF89EB7B "border =" 0 "alt =" 8EEF3B8405DC4462A7C024D6AF89EB7B "height =" 356 "src =" http://www.bkjia.com/uploads/allimg/131228/164SG015-2.jpg "/>
Example
For example, for the word "medium", the unicode code is 4E2D = 100111000101101.
If you use UTF16 encoding, the following two types of codes are available depending on the size of the end.
Fe ff 4E 2D UTF16 Large End
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" b96061b5ff%f09a073e58d621432b4 [4] "border =" 0 "alt =" alt = "California [4]" height = "96" src = "http://www.bkjia.com/uploads/allimg/131228/164SH1N-3.jpg"/>
Ff fe 2D 4E UTF16 Small End
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" 3F997E6FF7394FE4AEC5E113A09907DC [4] "border =" 0 "alt =" California [4] "height =" 90 "src =" http://www.bkjia.com/uploads/allimg/131228/164SIU9-4.jpg "/>
If UTF8 is used, the parameter value is set to 4E2D = 100111000101101. If the length is 15 bits
650) this. width = 650; "border =" 0 "alt =" "src =" http://www.bkjia.com/uploads/allimg/131228/164SK193-5.png "/>
Set 100111000101101 to a high position to fill the x, and use 0 if there is any deficiency.
1110 xxxx 10 xxxxxx 10 xxxxxx
11100100 10111000 10101101 = E4B8AD
The red is filled with 100111000101101, and the green is filled with 0.
UTF8 does not have BOM
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" ABED0EB6DFD74550B25B95FE30C06A46 [4] "border =" 0 "alt =" weight [4] "height =" 101 "src =" http://img1.51cto.com/attachment/201212/6/2359144_1354784741LKgV.jpg "/>
If UTF8 contains BOM, the Code ef bb bf is added.
650) this. width = 650; "style =" border-bottom: 0px; border-left: 0px; border-top: 0px; border-right: 0px "title =" 5115D114AFB0430AB8F4B065706BACCD [4] "border =" 0 "alt =" California [4] "height =" 98 "src =" http://www.bkjia.com/uploads/allimg/131228/164SJ3S-7.jpg "/>
The example shows that if BOM is ignored, the size ends. UTF16 does not necessarily occupy space than UTF8. Because UTF16 and UTF8 are both variable-length encoded. UTF8 is 1-4 bytes, and UTF16 is 2-4 bytes.
Reference:
Http://en.wikipedia.org/wiki/Unicode
Http://en.wikipedia.org/wiki/UTF-8
Http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html