Concept
Let's talk about the basic concept, which includes what is Unicode, What Is UTF-8, and what is UTF-16.
For a complete description of Unicode, UTF-8, and UTF-16, see Wiki (UNICODE, UTF-8, UTF-16 ). In simple terms, Unicode defines all the numerical sets (called code point) that can be used to represent characters ). UTF standards such as UTF-8 and UTF-16 define mappings between these values and characters.
UTF-8 advantages
The biggest advantage of UTF-8 is that there is no concept of byte order. Therefore, it is particularly suitable for network data transmission of strings without considering the size of the end.
Disadvantage
If UTF-8 is used during local string processing, there is not much problem with processing English characters. A char variable represents an English character. However, for Chinese and other Far East character sets, it is more difficult. Char STR []; STR [0] cannot fully represent a Chinese character. In UTF-8 encoding format, a Chinese character requires at least three Char to be expressed. This is a very painful task for string operations through subscript.
In addition, a Chinese character must contain at least three char characters, which also leads to a disadvantage in network transmission and occupies too much traffic.
UTF-16 advantages
UTF-16 Le is the default Unicode encoding method on Windows, represented by wchar_t. All wchar_t * types of strings (including hard-coded strings in. h /. CPP string literal), VC are automatically using UTF-16 encoding (String Literal Value, literal string, there are many pitfalls. In particular, for the char * type literal value, the final memory encoding method depends entirely on the current file encoding method. That is to say, if the current file is GBK encoded, the char * STR = "Noon" in the file, and the memory string binary indicated by STR is encoded using GBK. If the file encoding is a UTF-8, the memory uses UTF-8 encoding. So why should we always emphasize that strings should be placed in the resource file, rather than hard-coded in the. h/. cpp file !).
Another advantage of UTF-16 is that common characters can be expressed in two bytes, that is, a wchar_t (here refers to the Windows platform ). Therefore, on Windows, wchar_t is especially suitable for string storage. A wchar_t represents a character. Easy to use.
Disadvantage
There is no unified character type that represents UTF-16 encoding. The definition of wchar_t in C ++ 98/03 is very broad. As a result, wchar_t is 2 bytes in windows and 4 bytes in Unix-like systems. There may be challenges in code porting (I haven't transplanted it, so I'm not sure what the difficulty will be, and how difficult it will be ).
Even though char16_t has been defined in the latest C ++ 11 to indicate UTF-16, Ms vs2013 does not support char16_t. Therefore, char16_t is currently not portable.
As far as I know, UTF-16 encoding and GBK encoding, there is a Sort disadvantage. That is to say, if you want to sort Chinese Characters in alphabetical order of Chinese pinyin, GBK will get the correct results, and UTF-16 will not work (for now I have not such a demand, so I have not verified, but it seems that I am about to meet this requirement. I will verify it later ).
UTF-16 encoding string network transmission, to consider the size of the end of the problem.
UTF-32 advantages
This advantage is obvious. All characters are 4 bytes, with a fix-length. A wchar_t (Unix-like system) represents a character.
Disadvantage
For English strings, space consumption is high.
Faced with the same problems as the UTF-16 above. Consistency, sorting, and network transmission. Char32_t vs2013 is not supported yet (even vs 14 CPT is not intended to support it ).
Summary
UTF-8 is best suited for encoding formats transmitted over string networks. The UTF-16 is most suitable for encoding as a local string. If the network transmission protocol is defined, the UTF-16 is also very suitable as the network string transmission encoding format, especially Chinese and other Far East character set. Save traffic compared to UTF-8. UTF-32 no special hobby or demand, temporarily not used.