Coding knowledge Summary

Source: Internet
Author: User
Tags ranges

 

Coding knowledge Summary

The earliest encoding is ASCII, which is only 1-127, expressed in one byte. And the first bit of this byte is 0.

Later, many countries found that ASCII characters are too few. For example, Chinese characters cannot be expressed. Therefore, every country developed its own extended code, such as gb2312 in China.
Bay's big5, Japan's shift-JIS, etc. The extended code of each country is the same, that is, the extended code with the maximum length of 2 is used, which is mainly used to maintain
. Generally, the encoding standard is that for characters in ASCII, it must be in an ascii-compatible format, that is, the first digit is 0, but for their own texts in different countries, two bytes are used for representation, two
Energy-saving words indicate 2 ^ 16 = 65535 words, which is generally enough. The general practice is to set the first byte to 1. In this way, the computer will think that this is
Extended encoding, will be followed by a character, two characters as a word to process.

Gb2312 is such a variable-length code with a maximum length of 2. gb2312 contains a total of 7445 characters, including 6763 Chinese characters. Its Encoding range is 212h-777eh. We can see that, gb2312 does not all use the space of these two bytes. It can be expanded.

GBK is expanded on the basis of gb2312. Because more than 6000 Chinese characters defined in gb2312 are not enough, GBK encoding, GBK persistence and
Gb2312 is compatible with other uncommon Chinese characters. GBK has a total revenue of 21886 Chinese characters and symbols. GBK is also a variable-length encoding with a maximum length of 2 bytes.
The encoding range is between 8140-fefe.

Extended encoding (big5, shift-JIS) in other countries also adopts the same idea as gb2312/GBK and uses variable-length encoding up to two bytes, because it can be compatible with ASCII.

However, there are more than these Chinese characters in China, and there are also some especially uncommon Chinese characters. It seems that there are more than 70 thousand Chinese characters, which is beyond the range of two-character energy-saving representation. In this case, gb18030 encoding is generated.

Gb18030 is a variable-length code with a maximum length of 4 bytes. It is backward compatible with gb2312 and GBK. In addition, it extends many characters, totaling more than 70 thousand characters.
The length of gb18030 encoding may be 1, 2, and 4 bytes. For gb2312 and GBK-compatible parts, it uses two bytes. When the two bytes are not enough, it uses 4-byte encoding.
The encoding space of gb18030 is about 1.6 million bits. More than 20 thousand bits have been encoded before. The range of gb18030: one byte from 0x0-0x7f compatible with ASCII, two byte, first
The Byte ranges from 0x81-0xfe, And the last byte ranges from 0x40-0x7e and 0x80-0xfe, which are compatible with GBK. The first byte is from 0x81-0xfe, and the second byte is from
0x30-0x39, the range of the third and fourth bytes is the same as that of the first two bytes. The four bytes overwrite the data starting from 0x0080.
Unicode3.1 code bit. That is to say, gb18030 code exactly corresponds to the Unicode Standard in the code bit space.

Although all countries have defined their own codes, they are not common to each other, because they can only be used in 65535 locations in two bytes, but it has not yet formed a system
1. For example, if you use which segment I use, the mutual encoding is not uniform, and the parsing of A gb2312 segment will become garbled on shift-JIS. In this case
Unicode encoding.

Unicode defines more than one million characters, including the characters in the encoding of each country (such as gb2312, GBK, big5, etc.), if all the characters are in a unified format table
It takes 4 bytes. In fact, this is UTF32, And the UTF32 scheme is used in Linux. However, it can be analyzed that most characters can be expressed in 2 bytes, which can save null values.
For example, in Widnows, the Unicode scheme of two bytes is also called UTF16. In UTF16, a proxy is used to expand characters that cannot be expressed by two bytes.
Show (in fact, it is to mark the lower two bytes, indicating that this is a code, and the subsequent two bytes need to be connected to form a character ). Therefore, in general
Unicode refers to UTF16, while Unicode refers to UTF32 in Linux.

The Unicode name is "Universal multiple-octer coded character
Set ", for short, UCOS, is" Unicode Character
The abbreviation of set, so UCOS actually refers to Unicode. Unicode standard character encoding. UCOS specifies how to use multiple bytes to represent characters.

Ucs2. ucs2 adopts two formats: ucs2 and ucs4. ucs2 uses two bytes for encoding and ucs4 uses four bytes for encoding. (actually, only 31 bits are used, and the highest bits must be 0 ).

Both ucs2 and ucs4 are fixed-length codes. ucs2 has 2 ^ 16 = 65535 bits, and ucs4 has 2 ^ 31 = 2147483648 bits.

Ucs4 is divided into 2 ^ 7 = 128 groups based on the maximum byte with the highest bit of 0, and each group is divided into 256 Plane Based on the next high byte, each divided into 3rd bytes
256 rows. Each row contains 256 cells. The plane 0 of group 0 is called Basic multilingual
Plane, that is, BMP. That is to say, ucs2 can only represent the BMP part of ucs4. The conversion between BMP ucs2 and ucs4 encoding is simple.
Single, ucs2-> ucs4, as long as the first two bytes are added with all 0, and ucs4-> ucs2 is to remove the first two bytes with all 0 encoding. Ucs2 can only
The <= 65535 characters that are occasionally defined by BMP. However, the preceding ucs4 specification does not contain any characters other than BMP. (It seems that the new standard is out of BMP.
Otherwise, the extended characters of gb18030 cannot be stored in BMP ).

Unicode is an encoding method. To use Unicode in practice, you also need to encode Unicode (although it is good, it is not suitable for storing it in the file system.
Because ASCII is converted to ucs2, only a 0x0 is added before the encoding. Some controllers such as/will appear in these encodings, which are in UNIX and some C functions, will generate a serious error .), This
It produces a UTF-7, A UTF-8, A UTF-16, A UTF-32.

UTF-8 is an 8-bit character set with a variable encoding length ranging from 1 byte to 6 bytes. UTF-8 remains compatible with ASCII. Generally, UTF-8 uses one byte.
It represents ASCII characters, two bytes are used to represent Western European characters, and three characters are used to represent most Asian characters. UNIX platforms generally support UTF-8, most HTML, and file storage and transmission.
UTF-8 is used.

UTF-16 is also a variable-length encoding, but it is not ASCII compatible. UTF-16 is the ucs2 superset, it is actually ucs2 plus additional characters support, that is consistent
Ucs2 in unicode4.0 specification. UTF16 must contain at least two bytes to indicate one character, and four bytes to indicate the characters appended to ucs2. So the UTF-16 is either
2 bytes, or 4 bytes. UTF-16 is the main use of Windows platform encoding scheme, mainly in Windows2000 or later versions. Windows
Wchar_t is two bytes, should be the UTF-16.

A utf-32 is a fixed-length code that is almost the same as ucs4. Utf-32 encoding each code uses 4 bytes, and Linux uses the utf-32 solution.

UTF is the encoding scheme, so it also involves the issue of byte order. Byte order mark,
BOM) appears at the beginning of the Unicode stream, indicating the encoding type. Bom is a bit clever: there is a "zero width" in the ucscode.
No-break
Space "character, encoded as feff, while fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you before transmitting the byte stream, first pass
Input "Zero Width no-break space", so that when the recipient receives this character, it can be used to determine the byte order. Below is a commonly used BOM:

> UTF-16 big endian Fe FF
> UTF-16 little endian FF fe
> UTF-32 bign endian 00 00 Fe FF
> UTF-32 little endian FF Fe 00 00
> UTF-8 little endian EF BB BF

When we read these characters at the beginning of the Unicode stream, we can confirm the encoding sequence. UTF-8 is encoded in bytes, so there is no byte order problem, but you can use
BOM to indicate the encoding method. The UTF-8 encoding of the character "Zero Width no-break space" is ef bb bf. Therefore, if you receive ef bb
The byte stream starting with BF indicates UTF-8 encoding. In Windows, if you use NotePad to save a file in UTF-8 format, its header is
Start with BF. In Linux, to maintain ASCII compatibility, all UTF-8 files do not contain the ef bb bf header.

One advantage of the UTF-8 design is that It encodes a set of vertices into word throttling, rather than words or Dwords, which can ignore the underlying machine's endian question
Question. This means that you can exchange UTF-8 streams between two machines with a small-tail byte order and a large-tail byte order without any need to restructure or add Bom. That is to say, you can completely ignore the underlying architecture.
Structure.

Another advantage of UTF-8 encoding is that it stores the bit of the actual code point from left to right and sorts the string by the original byte in binary form. Although it is better to press
Locale sorting rules are so good for sorting, but for the underlying system that does not need to understand the UFF-8, it eventually provides a simple sort method, the underlying system only needs to know how to sort the original bytes
You can.

Summary:

Unicode is a standard, which has two formats: ucs2 and ucs4. Both ucs2 and ucs4 are fixed. Basically it can be understood that ucs4 is a UTF-32, and ucs2 is compatible with the UTF-16, but the UTF-16 extends some.

UTF is Unicode implementation, it is divided into UTF-8, UTF-16, utf-32 several forms, of which UTF-8 and UTF-16 are variable length, and utf-32 is fixed length encoding. (In fact there are utf-7 and Other encoding)

Attachment: byte layout of UTF-8:

> Bytes Number of digits Indicates
> 1 7 0 bbbbbbb
> 2 11 110 bbbbb 10 bbbbbb
> 3 16 1110 BBBB 10 bbbbbb 10 bbbbbbbb
> 4 21 11110bbb 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb
> 5 26 111110bb 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10
> 6 31 1111110b 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb
> 7 36 11111110 10 bbbbbb 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb
> 8 42 11111111 10 bbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb 10 bbbbbbbb

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.