A detailed explanation of Unicode, UTF-8, and ANSI character encodings

Source: Internet
Author: User
Tags character set coding standards

NSI, UTF-8, Unicode, three encoded formats for character codes, one character can be encoded into ANSI, UTF-8, or Unicode format, and the three formats are only different in expression and represent the same content.

ANSI, UTF-8, Unicode

ANSI, UTF-8, Unicode, three encoding formats for character codes, one character can be encoded into ANSI, UT-F8, or Unicode format, and the three formats are only different in expression and represent the same content. As shown in the following table:

Char ANSI (GBK) Unicode UTF-8
Zhong 0xd6d0 0x4e2d 0xe4b8ad

ANSI Encoding

ANSI denotes an English character with a byte, representing two bytes in Chinese

In order to enable computers to support multiple languages, different countries and regions have developed different standards, resulting in GB2312, BIG5, JIS and other coding standards. These use 2 bytes to represent a single character in a variety of Chinese character extension encoding, called ANSI encoding. Under the Simplified Chinese system, ANSI encoding represents GB2312 encoding, and ANSI encoding represents JIS code under Japanese operating systems.

For ANSI encodings, the characters between 0x00~0x7f are still 1 bytes representing a single character (ASCII encoding), and the characters outside this usually represent one character using two bytes in the 0x80~0xff range. For example, the Chinese character to find the ' middle ' in the Simplified Chinese use [0xd6, 0xd0] These two byte storage.

The following table shows the encoding of the text under different ANSI standards:

Char ANSI (GBK) ANSI (BIG5) ANSI (JIS) Unicode UTF-8
Wen 0xcec4 0xa4e5 0x95b6 0x6587 0xe69687
As you can see, different ANSI codes are incompatible, and when information is exchanged internationally
, you cannot store text that belongs in both languages in the same section of ANSI-encoded text. Different ANSI encodings need to be converted to UTF-8 encoding for storage.

Unicode encoding

Unicode character Set encoding full name: Universal Multiple-octet coded Character set, universal multiple eight-bit coded character set. The Unicode character set is an encoding scheme developed by an international organization that can hold all the text and symbols in the world.

Unicode encoding uses two bytes (0X0000-0XFFFF) to represent a character, and any text and symbol in the world corresponds to a binary code in the Unicode character set, but:

Unicode is just a set of symbols that specify the binary code of symbols, but do not specify how the binary should be stored.

The advantage of Unicode encoding is that it covers all the text and symbols in the world, and the flaw is that it wastes a byte for English characters. For example: English A is represented as 0x0041 in Unicode.

UTF-8 Code

UTF-8 is one of the ways Unicode is implemented.

UTF-8 Full Name: 8bit Unicode transformation format,8 bit Unicode universal conversion format. UTF-8 is a variable-length character encoding for Unicode. can represent any one character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII.

UTF-8 is a variable-length encoding that allows the Unicode character set to be encoded using 1~6 bytes, as follows:

1 for Single-byte symbols, the first bit of the byte is set to 0, followed by the 7-bit Unicode code for this symbol. So for
English letters, UTF-8 codes and ASCII codes are the same.

2 for the N-byte symbol (N>1), the first n bits of a byte are set to 1, the n+1 bit is set to 0, and the front of the byte is followed
Two-digit is set to 10. The remaining bits, all of which are not mentioned, are all Unicode codes for this symbol.

n Unicode symbol range UTF-8 encoding method

1 0000 0000-0000 007F 0XXXXXXX
2 0000 0080-0000 07FF 110XXXXX 10xxxxxx
3 0000 0800-0000 FFFF 1110XXXX 10xxxxxx 10xxxxxx
4 0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 0020 0000-03ff FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 0400 0000-7fff FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Note: In the UTF-8 encoding, the English character occupies a byte and the Chinese character occupies 3 bytes.

Summarize
1, the Chinese operating system default ANSI code, generated TXT file defaults to ANSI code.

2, international documents (TXT and XML) using Unicode encoding is an authentic practice; the operating system and the browser are able to "understand" Unicode encoding. The browser "is under pressure" to "understand" the Utf-8 code. However, the operating system sometimes only recognized Unicode encoding.

3. Windows Notepad has four encoding options: ANSI, Unicode, Unicode big endian, and UTF-8.

ANSI is the default encoding method. For English files is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use BIG5 code).
Unicode encoding refers to the UCS-2 encoding, which is a Unicode code that is stored directly in a character by two bytes. This option is used in little endian format.
The Unicode big endian encoding corresponds to the previous option. Use the big endian format.
UTF-8 refers to the UTF-8 with a BOM.
ANSI, UTF-8, Unicode conversions
Windows Unicode and Character Sets
The Unicode coded character set is the most common character encoding standard, and Windows applications use the UTF-16 implementation version of the Unicode character set. At the same time, Windows supports traditional character sets: Single-byte character sets (Single-byte character sets, SBCS) and multibyte character sets (multibyte character sets).

Many Windows API functions have "a" and "W" versions, the "a" version is based on Windows Code Page, and the "W" version is based on Unicode characters. The application can convert the Unicode string and the Windows Code page string using the WideCharToMultiByte and MultiByteToWideChar two functions. Although the function name contains "multibyte", these functions can actually handle the SBCS, DBCS, and multibyte character set Code page.

Encoding Conversion
Under the Windows platform, the conversion between ANSI, UTF-8, and Unicode is mainly dependent on the WideCharToMultiByte and MultiByteToWideChar two functions.

Unicode UFT-8: Sets the WideCharToMultiByte codepage parameter to Cp_utf8;
UTF-8 to Unicode: Set MultiByteToWideChar codepage parameter to Cp_utf8
Unicode to ANSI: Set WideCharToMultiByte codepage parameter to CP_ACP ;
ANSI to Unicode: Sets the codepage parameter for MultiByteToWideChar to CP_ACP;
UTF-8 to ANSI: Convert UTF-8 to Unicode, then convert Unicode to ANSI;
ANSI UTF-8: First convert ANSI to Unciode, and then convert Unicode to ANSI.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.