Knowledge about the character set

Source: Internet
Author: User
Tags control characters

ASCII: Early character set, 7-bit, 128-character, including case-by-letter, 0-9-digit, and some control characters.

extended ASCII: 1 bytes 8 bits, only 7 bits unreasonable. The 8th bit is used to extend the ASCII character set so that it has 128 more characters. Then use the latter 128 characters to extend the representation of the Latin alphabet, Special symbols such as the Greek alphabet. But the problem is that a lot of countries in Europe have different special letters with each other, crammed into the last 128 is obviously not enough, so the code page appears.

Code page: 1 bytes First 128 characters Everyone unified and ASCII, and then 128 characters, according to different systems so-called code page to distinguish between the various languages of the letters and symbols.

DBCS( double-byte character set ): For Asian countries, the last 128 characters still cannot contain a large number of hieroglyphs, which is a solution for this. DBCS is represented by one or two bytes of one character, This means that DBCS is not necessarily two bytes, and is ASCII-compatible for such letters, and is still represented by 1 bytes, while for example Chinese it is represented in 2 bytes. English and Chinese can be processed uniformly, and the method of distinguishing whether to encode in Chinese is 2 bytes in the first place of the high byte is 1, You must check the byte that follows it, and 2 bytes are interpreted as 1 characters. GB2312,GBK to GB18030 all belong to DBCS. In addition, ANSI encoding in Simplified Chinese windows is usually referred to as GBK (code page 936).

The big problem with DBCS is that the number of characters in a string cannot be determined by the number of bytes, such as "Chinese abc", the number of characters is 5, and the number of bytes is 7. This is a nightmare for programmers who traverse strings with the + + or--operator!

Unicode: The scientific name is "Universal multiple-octet Coded Character Set", referred to as "UCS". UCS can be seen as an abbreviation for "Unicode Character Set".

is also a character set/character encoding method, which unifies a unique character set to contain the writing system of most languages on the planet. The UCS is ASCII-compatible (that is, the first 128 characters are consistent), but it is not compatible with DBCS because other characters are re-encoded in the UCS (rearranging the location).

UCS is available in two formats: UCS-2 and UCS-4. The former is encoded in 2 bytes (16 bits), which is encoded with 4 bytes (actually 31 bits only). USC-4 the first 2 bytes are 0 of the part called BMP (Basic multi-language plane), that is, BMP removal of the first 2 0 bytes is UCS-2. No characters in the current UCS-4 specification are allocated outside of BMP. (Plainly, USC-4 is for when the 16-bit USC-2 have been allocated to do the expansion again, and now it is useless)

  utf-8,utf-16,utf-32: "Unicode Transformation Format" (UTF) , which is the transmission format for Unicode. Unicode specifies how characters are encoded, UTF specifies how to map a Unicode character cell to a byte order for transmission or saving.

UTF-16 and UTF-32 respectively to the 16-bit and 32-bit for a Unicode unit encoding, in fact UTF-16 corresponds to ucs-2,utf-32 corresponds is UCS-4 ( UCS-2 and UCS-4 are old sayings that should be discarded) [see here]. In addition, Unicode is usually referred to as UTF-16.

UTF-8 is the key! If unified Unicode is represented by 2 bytes, the English letter feels like a disadvantage (the high byte is always 0 bytes). UTF-8 provides a flexible solution: single-byte (8bit) as the encoding unit, variable-length multibyte encoding. If the ASCII letter continues to use 1 bytes of storage, Chinese characters are stored in 3 bytes, and the other can be up to 6 bytes straight.

UTF-16 and UTF-32 need to have a byte-order sign BOM (FEFF) to solve the big-endian problem. The UTF-8 does not have a problem with the byte order (because the unit is 1 bytes).

Knowledge about the character set

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.