Summary Unicode, UTF-8, and UTF-16 in a few simple words

Source: Internet
Author: User

Concept

Let's talk about the basic concept, which includes what is Unicode, What Is UTF-8, and what is UTF-16.

For a complete description of Unicode, UTF-8, and UTF-16, see Wiki (UNICODE, UTF-8, UTF-16 ). In simple terms, Unicode defines all the numerical sets (called code point) that can be used to represent characters ). UTF standards such as UTF-8 and UTF-16 define mappings between these values and characters.

UTF-8 advantages

The biggest advantage of UTF-8 is that there is no concept of byte order. Therefore, it is particularly suitable for network data transmission of strings without considering the size of the end.

Disadvantage

If UTF-8 is used during local string processing, there is not much problem with processing English characters. A char variable represents an English character. However, for Chinese and other Far East character sets, it is more difficult. Char STR []; STR [0] cannot fully represent a Chinese character. In UTF-8 encoding format, a Chinese character requires at least three Char to be expressed. This is a very painful task for string operations through subscript.

In addition, a Chinese character must contain at least three char characters, which also leads to a disadvantage in network transmission and occupies too much traffic.

UTF-16 advantages

UTF-16 Le is the default Unicode encoding method on Windows, represented by wchar_t. All wchar_t * types of strings (including hard-coded strings in. h /. CPP string literal), VC are automatically using UTF-16 encoding (String Literal Value, literal string, there are many pitfalls. In particular, for the char * type literal value, the final memory encoding method depends entirely on the current file encoding method. That is to say, if the current file is GBK encoded, the char * STR = "Noon" in the file, and the memory string binary indicated by STR is encoded using GBK. If the file encoding is a UTF-8, the memory uses UTF-8 encoding. So why should we always emphasize that strings should be placed in the resource file, rather than hard-coded in the. h/. cpp file !).

Another advantage of UTF-16 is that common characters can be expressed in two bytes, that is, a wchar_t (here refers to the Windows platform ). Therefore, on Windows, wchar_t is especially suitable for string storage. A wchar_t represents a character. Easy to use.

Disadvantage

There is no unified character type that represents UTF-16 encoding. The definition of wchar_t in C ++ 98/03 is very broad. As a result, wchar_t is 2 bytes in windows and 4 bytes in Unix-like systems. There may be challenges in code porting (I haven't transplanted it, so I'm not sure what the difficulty will be, and how difficult it will be ).

Even though char16_t has been defined in the latest C ++ 11 to indicate UTF-16, Ms vs2013 does not support char16_t. Therefore, char16_t is currently not portable.

As far as I know, UTF-16 encoding and GBK encoding, there is a Sort disadvantage. That is to say, if you want to sort Chinese Characters in alphabetical order of Chinese pinyin, GBK will get the correct results, and UTF-16 will not work (for now I have not such a demand, so I have not verified, but it seems that I am about to meet this requirement. I will verify it later ).

UTF-16 encoding string network transmission, to consider the size of the end of the problem.

UTF-32 advantages

This advantage is obvious. All characters are 4 bytes, with a fix-length. A wchar_t (Unix-like system) represents a character.

Disadvantage

For English strings, space consumption is high.

Faced with the same problems as the UTF-16 above. Consistency, sorting, and network transmission. Char32_t vs2013 is not supported yet (even vs 14 CPT is not intended to support it ).

Summary

UTF-8 is best suited for encoding formats transmitted over string networks. The UTF-16 is most suitable for encoding as a local string. If the network transmission protocol is defined, the UTF-16 is also very suitable as the network string transmission encoding format, especially Chinese and other Far East character set. Save traffic compared to UTF-8. UTF-32 no special hobby or demand, temporarily not used.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.