Talking about the character encoding thing.

Source: Internet
Author: User

ASCII encoding, which is the 256 characters required to display text in English (such as English letters, numbers, punctuation marks, etc.), each character is represented by one byte, that is, single-byte encoding (SBCS,CHAR), and the ASCII code table defines the mapping of values and characters.

Based on the retention of ASCII encoding, ANSI extends the use of 2 (or more) bytes to represent a character, that is, multibyte encoding (MBCS,CHAR), which is a general term that can be encoded in conjunction with specific codepage (also known as coded character set, code page). Different countries or regions have different standards (numeric and character mapping), and different codepage to define their respective mapping relationships. Different languages of the operating system, the use of codepage, such as the Chinese operating system ANSI representative GB2312, Japanese operating system ANSI for JIS. GBK, GB2312 is the 2 codepage that defines Chinese character coding, where GB2312 is an extension to GBK and contains the GBK character set in its entirety.

Unicode unifies all language text in the world, it has 2 kinds of specifications UCS-2, UCS-4, generally use UCS-2, it specifies that the characters are all two bytes, that is, double-byte encoding (DBCS,WCHAR). UTF8 and UTF16 are 2 different implementations of Unicode encoded storage and transmission methods. UTF8 uses a byte to express the English alphabet, with two (or more) bytes to express the characters in other languages, UTF16 uniformly uses two bytes to express a character (including English letters, other words), encoding and Unicode are equivalent. BOM (byte order mark), which is the beginning of the text file of a few identification bytes, to illustrate the encoding method, Utf-8 BOM is 0xef 0xbb 0xbf,utf-16le (Little Endian) BOM is 0xFF 0xFE, The utf16-be (Big Endian) BOM is 0xFE 0xFF.

UTF8, UTF16 is actually the same level of the concept of ANSI, they are a character encoding method, the difference is that UTF8, UTF16 using a Unicode definition of a can accommodate all the languages of the world codepage, namely UTF8, UTF16 represents a certain character encoding, and ANSI needs to be combined with specific codepage to determine the encoding method, so when we switch the file encoding format, we often see UTF8 usually and GB2312, iso-8859-1 and other codepage in the form of a lateral appearance. UNICODE, ASCII can be seen as a codepage, defines the value and character mapping, where the ASCII code table is compatible with all encodings, which means that the English content, regardless of the encoding process, can always be displayed correctly.

CodePage not only defines the mapping between numeric values and local text, but also defines the mapping of numeric and Unicode characters so that different ANSI encodings (CodePage) can be converted to each other through Unicode. However, this conversion of character encoding is usually only meaningful between UTF8 (or other Unicode implementations) and local codepage, and there is no meaning between local codepage and codepage conversions in other languages, both text display and semantics are wrong. The codepage can be expressed as a numeric ID or as a string name, noting that the ID and name are not uniformly defined by the industry, such as the gb2312 character set, the name in VS is gb2312, and the corresponding name in Iconv is cp936.

The Windows Platform API has 2 sets, respectively, is the Unicode version and the ANSI edition, if using the Unicode version of the API, to ensure that the incoming string is Unicode encoding, if using the ANSI version of the API, to ensure that the incoming string encoding matches the operating system's default character set encoding , such as the Simplified Chinese operating system, the incoming string must be GB2312 encoding (UTF8 encoding does not), otherwise Chinese characters will be displayed as garbled, this is the root cause of all garbled. When you open a text file with a text editor, the editor will infer the file encoding based on the information provided by the file itself, and then convert it to the default character set of the operating system for display, and if the editor guesses that the error may be garbled, you can manually specify which encoding to convert the text content to.

Visual Studio saves and processes the source files by default in ANSI encoding, so opening a UTF8 encoded source file in VS will be garbled and can be resolved by modifying the VS default encoding. Note that even if the VS default encoding is changed to UTF8, the string constants defined in the code are still encoded by the operating system's default character set when the program executes, rather than the encoding used when the source file is saved. For example, the code defines a string constant "China", even if the source file is saved with UTF8 encoding, but when the code executes, "China" is still using the operating system's default character set encoding, can be confirmed by outputting string constants to the log file. In fact, this is not difficult to understand, the program actually runs the string constant, from the PE file, rather than the source file, the compiler can completely in the compilation phase, the constant string encoding conversion (from the source file encoding, conversion to the operating system default encoding), and then output to the obj file, eventually linked to the PE file.

Finally, add that the Windows Simplified Chinese operating system by default uses the gb2312 character set, Linux (including Android, Mac, iOS, etc.) operating system by default is the UTF8 encoding, which means that Linux has more extensive support for different regions of the language text!

Top
1
Step

Talking about the character encoding thing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.