Problem
There is an emoji expression in IOS. Many machines cannot display this expression normally. I am determined to study this problem. I have referenced the following points:
1. Tools for converting strings into Unicode and UTF-8. Click to download
2. Click to open Wikipedia UTF-16
3. Click here for the introduction of the author's blog article UTF-8
For example, enter an emoji smile in the input box and check its encoding:
We can see that Unicode encoding is a D83D-DE03. UTF-8 encoding is F09F-9883. This is unusual and will be introduced below!
Note: UTF-8 is usually 1 to 3 bytes, that is, it is on the 0th planes of the Unicode encoding space. It is necessary to describe the UTF-8 encoding rules (Click here for more information ):
In the above table, we will give an example of the Chinese character "Han. Its Unicode value is 0x6c49, and utf8 value is 0xe6b189. The formula is correct.
Let's look at the UNICODE: D83D-DE03 of the "Smile" symbol, which has exceeded the maximum 0x10ffff, then how does it represent it ??? We are based on UTF-8: F09F-9883. to reverse the corresponding Unicode value, Let's see why:
The result is 0x1-f603. This result is too different from UNICODE: the value of the D83D-DE03, so the intermediate must have gone through some conversion steps, this conversion is the UTF-16 proxy !!!
UFT-16
UTF is the abbreviation of "Unicode/UCOS Transformation Format", meaning to convert Unicode characters to a certain format. The second image above shows the corresponding tables of UTF-8 and Unicode. This is a simple correspondence, but it is very useful.
Under normal circumstances a unicode two bytes, when converting the uft-8, according to the protocol, two bytes, corresponding to a uft-8 to complete the conversion or called ing!
In fact, in the 0th plane, there is a special proxy area, used to point to 1st to 16th characters in the plane, this area is: D800--DFFF .. 0xd800 -- 0xdbff is the leading proxy (lead surrogates). 0xdc00 -- 0xdfff is the trailing proxy (TRAIL surrogates). A proxy pair (leading, trailing) represents a character of a UTF-16. In the face of emoji smile, the frontend is the Proxy: d83d; the backend proxy is: de03. According to the result, the value of the UTF-16 is 0x1-f603.
This takes care of it.
As a programmer, the author makes a metaphor: This pair (the front proxy, the back-end proxy) is like a pointer, pointing to every code bit on the plane 1-16. After calculation, it is not difficult to conclude that each of the 16 flat planes X has 65536 code bits = 1,048,576, And the frontend X and backend X are also 1,048,576. This is a perfect solution !!! As shown in.
The advantage of doing so is that we can judge based on the first byte of UNICODE:
If (the first byte of Unicode >=0xd8 & Unicode <= 0xdb) {// This is the proxy region, indicating the characters in the 1--16 plane. Each four bytes represents a unit} else {// This is the normal ing area, indicating 0th planes. Each two bytes represents a unit .}
The result is: According to this Protocol, the computer can know whether two bytes or four bytes represent one character.
Summary
The UTF-8 and UTF-16 here are essentially the same. UTF-8 is a direct ing. The UTF-16 needs to be mapped according to the proxy area (the front proxy, the back-end proxy. UTF-16 is a step more than UTF-8!
Then again: if not the appearance of the agent area, the UNICODE: D83D-DE03 of the emoji smile. The computer does not even know whether it is a single character or two characters?
UTF-16 code from iPhone emoji