Summary Unicode, UTF-8, and UTF-16 in a few simple words

Last Update:2014-08-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Concept

Let's talk about the basic concept, which includes what is Unicode, What Is UTF-8, and what is UTF-16.

For a complete description of Unicode, UTF-8, and UTF-16, see Wiki (UNICODE, UTF-8, UTF-16 ). In simple terms, Unicode defines all the numerical sets (called code point) that can be used to represent characters ). UTF standards such as UTF-8 and UTF-16 define mappings between these values and characters.

UTF-8 advantages

The biggest advantage of UTF-8 is that there is no concept of byte order. Therefore, it is particularly suitable for network data transmission of strings without considering the size of the end.

Disadvantage

If UTF-8 is used during local string processing, there is not much problem with processing English characters. A char variable represents an English character. However, for Chinese and other Far East character sets, it is more difficult. Char STR []; STR [0] cannot fully represent a Chinese character. In UTF-8 encoding format, a Chinese character requires at least three Char to be expressed. This is a very painful task for string operations through subscript.

In addition, a Chinese character must contain at least three char characters, which also leads to a disadvantage in network transmission and occupies too much traffic.

UTF-16 advantages

UTF-16 Le is the default Unicode encoding method on Windows, represented by wchar_t. All wchar_t * types of strings (including hard-coded strings in. h /. CPP string literal), VC are automatically using UTF-16 encoding (String Literal Value, literal string, there are many pitfalls. In particular, for the char * type literal value, the final memory encoding method depends entirely on the current file encoding method. That is to say, if the current file is GBK encoded, the char * STR = "Noon" in the file, and the memory string binary indicated by STR is encoded using GBK. If the file encoding is a UTF-8, the memory uses UTF-8 encoding. So why should we always emphasize that strings should be placed in the resource file, rather than hard-coded in the. h/. cpp file !).

Another advantage of UTF-16 is that common characters can be expressed in two bytes, that is, a wchar_t (here refers to the Windows platform ). Therefore, on Windows, wchar_t is especially suitable for string storage. A wchar_t represents a character. Easy to use.

Disadvantage

There is no unified character type that represents UTF-16 encoding. The definition of wchar_t in C ++ 98/03 is very broad. As a result, wchar_t is 2 bytes in windows and 4 bytes in Unix-like systems. There may be challenges in code porting (I haven't transplanted it, so I'm not sure what the difficulty will be, and how difficult it will be ).

Even though char16_t has been defined in the latest C ++ 11 to indicate UTF-16, Ms vs2013 does not support char16_t. Therefore, char16_t is currently not portable.

As far as I know, UTF-16 encoding and GBK encoding, there is a Sort disadvantage. That is to say, if you want to sort Chinese Characters in alphabetical order of Chinese pinyin, GBK will get the correct results, and UTF-16 will not work (for now I have not such a demand, so I have not verified, but it seems that I am about to meet this requirement. I will verify it later ).

UTF-16 encoding string network transmission, to consider the size of the end of the problem.

UTF-32 advantages

This advantage is obvious. All characters are 4 bytes, with a fix-length. A wchar_t (Unix-like system) represents a character.

Disadvantage

For English strings, space consumption is high.

Faced with the same problems as the UTF-16 above. Consistency, sorting, and network transmission. Char32_t vs2013 is not supported yet (even vs 14 CPT is not intended to support it ).

Summary

UTF-8 is best suited for encoding formats transmitted over string networks. The UTF-16 is most suitable for encoding as a local string. If the network transmission protocol is defined, the UTF-16 is also very suitable as the network string transmission encoding format, especially Chinese and other Far East character set. Save traffic compared to UTF-8. UTF-32 no special hobby or demand, temporarily not used.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary Unicode, UTF-8, and UTF-16 in a few simple words

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Summary Unicode, UTF-8, and UTF-16 in a few simple words

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support