Talking about the character encoding thing.

Source: Internet
Author: User

ASCII encoding, which is the 256 characters required to display the text in English (e.g., English letters, numbers, punctuation marks, etc.), each character is represented by a byte, i.e. single-byte encoding ( SBCS , Char), the ASCII code table defines the mapping of values and characters.

based on the retention of ASCII encoding, ANSI extends it by using 2 (or more) bytes to represent a single character, that is , multibyte encoding (MBCS , char), which is a general term to determine the encoding in conjunction with specific codepage (also called coded character set, code page) . Different countries or regions have developed different standards (numerical and character mapping), through different CodePageto define their own different mapping relationships. different languages of the operating system, the use of codepage, such as the Chinese operating system ANSI Representative GB2312, Japanese operating system ANSI for JIS. GBK,GB2312is the 2 codepage that defines Chinese character coding, where GB2312 is an extension to GBK, which contains the GBK character set in its entirety.

    UNICODEUnified encoding of all languages in the world, it has 2 kinds of specifications UCS-2,UCS-4, generally use UCS-2, it specifies that the characters are all two bytes, that is, double-byte encoding (DBCS, WCHAR). UTF8,UTF16is aUNICODE2 different implementations of encoded storage and transmission modes. UTF8 uses a byte to express the English alphabet, with two (or more) bytes to express the characters in other languages, UTF16 uniformly uses two bytes to express a character (including English letters, other words), encoding and Unicode are equivalent. BOM ( byte Order mark The first few identifying bytes of the text file, which are used to describe the encoding method, Utf-8 BOM is 0xef 0xbb 0xbf, utf-16le (Little Endian) The BOM is 0xFF 0xFE, utf16-be Big Endian" is 0xFE 0xFF.

    UTF8, UTF16 is actually a concept at the same level as ANSI, they are all a character encoding, except that UTF8 and UTF16 use Unicode A defined co that accommodates all languages of the world Depage, that is, UTF8, UTF16 represents a certain character encoding, and ANSI will need to combine the specific codepage to determine the encoding method, so when we switch the file encoding format, we often see UTF8 usually and GB2312, iso-8859-1

    codepage define not only the mapping of values and local text, but also the values and characters so that different ANSI encodings (CodePage) can be converted to each other through Unicode. However, encoding implementation is meaningful between the local codepage and the native codepage, and the conversion between the local and other languages is meaningless, and the text display and semantics are all wrong. The codepage can be expressed as a numeric ID or as a string name, noting that the ID and name are not uniformly defined by the industry, such as the gb2312 character set, the name in VS is gb2312, and the corresponding name in Iconv is Span style= "color: #ff0000; Background-color:inherit; Font-family:simsun; font-size:14px ">cp936

    Windows Platform API has 2 sets, respectively, is the Unicode version and the ANSI edition, if using the Unicode version of the API, to ensure that the incoming string is Unicode encoding, If you use the ANSI version of the API, to ensure that the incoming string encoding and the operating system's default character set encoding matching, such as the Simplified Chinese operating system, the incoming string must be GB2312 encoding (UTF8 encoding does not), otherwise Chinese characters will be displayed as garbled, this is the root cause of all garbled. When you open a text file with a text editor, the editor will infer the file encoding based on the information provided by the file itself, and then convert it to the default character set of the operating system for display, and if the editor guesses that the error may be garbled, you can manually specify which encoding to convert the text content to.

    Visual Studio saves and processes the source files by default in ANSI encoding, so opening a UTF8 encoded source file in VS, Chinese is garbled and can be resolved by modifying the VS default encoding. Note that the Even if Change the VS default encoding to UTF8, and the string constants defined in the code are still encoded by the operating system's default character set when the program executes

Finally, add that the Windows Simplified Chinese operating system by default uses the gb2312 character set, Linux (including Android, Mac, iOS, etc.) operating system by default is the UTF8 encoding, which means that Linux has more extensive support for different regions of the language text!

Talking about the character encoding thing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.