GBK, UTF-16 and UTF-8 for Chinese character encoding

Source: Internet
Author: User

In programming, these three character encoding formats are often converted to each other, so that many third-party libraries fail to call for unknown reasons, in fact, many are because third-party libraries support UTF-8 rather than Windows Default support of the UTF-16 caused.

The following describes the three common character encoding methods in windows.

Gb2312

It is the Chinese character encoding Character Set of our country's own national standard. This character set represents a Chinese character in a 16-bit binary data unit, so we can save two char data units as one Chinese character.

The Chinese character encoding Character Set of Microsoft Windows operating system supports gb2312. This is why we use:

Const char * pchar = "Chinese ";

Printf (pchar );

The reason why the Chinese character is displayed correctly.

However, if we want to convert the program into a text other than Chinese and standard ASCII characters (for example, Korean), because Korean cannot be parsed by gb2312, garbled characters will occur.

This is why Unicode is recommended by Microsoft. Since Unicode contains all the character sets known to humans, it can theoretically parse all the text.

 

Unicode

Unicode Character Set is actually an International StandardISO 10646. The Unicode Character Set is published by the Unicode Association.

 

ISO 10646DefinedUniversal Character Set). UCOS is a superset standard for all other character sets. ISO 10646 defines a 31-bit character set. however, in this huge encoding space, only the first 65534 code bits (0x0000 to 0 xfffd) are allocated so far ). the 16-bit subset of this UCS is calledBasic multilingual plane (BMP).
Characters other than 16-bit BMP are special characters (such as hieroglyphics), and they are used only by experts in the field of history and science.

UTF-16

Only an integer is allocated to the character encoding table. there are several methods to represent a string of characters as a string of bytes. the two most obvious methods are to store Unicode text as strings of 2 or 4 byte sequences. the formal names of the two methods are UCS-2 and UCS-4, respectively. windows Unicode represents a very UCS-2, that is, two bytes representing a Unicode character.

Unless otherwise specified, most of the bytes are bigendian convention. convert an ascii or Latin-1 file to a UCS-2 simply insert 0x00 before each ASCII byte. to convert to UCS-4, you must insert three 0x00 before each ASCII byte.

The UCS-2 standard is used inside windows and implemented with a UTF-16. Symbols (Basic multilingualplane, BMP) defined in the basic multilingual plane, or the zeroth plane (plane 0) are expressed in 2 bytes.

Therefore, wchar_t used in Windows is measured in 2 bytes, and an ASCII character must be expressed in two bytes.

Java is also a UTF-16.

UTF-8

Using UCS-2 (or UCS-4) in UNIX can cause very serious problems. the encoded strings contain special characters, such as '\ 0' or'/'. They have special meanings in the file name and other C-library function parameters. in addition, most UNIX tools that use ASCII files cannot read 16 characters without making major changes. for these reasons, in file names, text files, environment variables, and other places,UCS-2Not SuitableUnicode.

In ISO 10646-1
Defined in Annex R and RFC 2279UTF-8Encoding does not solve these problems. It is an obvious way to use Unicode in Unix-style operating systems.

 

UTF-8 has a characteristic:

  • The UCS character U + 0000 to U + 007f (ASCII) is encoded as byte 0x00 to 0x7f (ASCII compatible ). this means that files containing only 7 ASCII characters are the same in both ASCII and UTF-8 encoding methods.
  • All> U + 007f UCOS characters are encoded into a string of multiple bytes, each of which has a tag set. therefore, ASCII bytes (0x00-0x7f) cannot be part of any other character.
  • The first byte of a non-ASCII multi-byte string is always in the range from 0xc0 to 0xfd, and indicates the number of bytes contained in the character. the remaining bytes of the multibyte string are in the range of 0x80 to 0 x BF. this makes re-synchronization very easy, and makes the encoding without borders, and is rarely affected by the loss of bytes.
  • Can be compiled into all possible 231 UCS code
  • In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP characters can only be up to 3 bytes long.
  • The order of the bigendian UCS-4 byte strings is predetermined.
  • Bytes 0xfe and 0xff are never used in UTF-8 encoding.

Note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string.

UTF-8 is widely used in web protocols and UNIX operating systems. ASCII is not converted, and other characters are variable-length encoded. Each character is 1-3 bytes.

After figuring out the three encoding methods, let's talk about how to convert each other.

// Wchar_t converted to UTF-8 inline string convertwchar2utf8 (const wchar_t * a_szsrc) {const int nszbuffer = widechartomultibyte (cp_utf8, 0, a_szsrc,-1, null, 0, null, null ); char * buffer = new char [nszbuffer]; widechartomultibyte (cp_utf8, 0, a_szsrc,-1, buffer, nszbuffer, null, null); string strreturn = buffer; Delete [] buffer; return strreturn ;};

 

References:

Http://unicode.org/faq/utf_bom.html

Http://zh.wikipedia.org/wiki/UTF-16

Http://zh.wikipedia.org/wiki/UTF-8

Http://www.linuxforum.net/books/UTF-8-Unicode.html

Http://zh.wikipedia.org/wiki/GB_2312

Http://zhidao.baidu.com/question/27910414.html

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.