ANSI, utf8, Unicode encoding

Source: Internet
Author: User

Recently written Network Data TransmissionProgramIt is a mess caused by various codes. The simple record here is as follows:

1. ASCII and ANSI Encoding

Charcter Code refers to the internal code used to represent characters. Readers must use the internal code when entering and storing documents. The internal code is divided
A. single-byte internal code-single-byte character sets (sbcs), which can be encoded with 256 characters.
B. Double Byte internal code -- double-byte character sets (DBCS), which can be encoded with 65000 characters.
The former is ASCII encoding, and the latter corresponds to ANSI. In a simplified Chinese operating system, ANSI refers to gb2312,CodePage 936 (different languages have different code pages in ANSI ).

2. gb2312 and GBK Encoding

Gb2312 is an extension of ANSI in simplified Chinese. Gb2312 contains a total of seven thousand characters. Because gb2312 supports too few Chinese characters and does not support traditional Chinese, GBK has extended gb2312 to support traditional Chinese and more characters, A total of 22000 characters are supported for GBK. gb18030 is based on GBK and adds major ethnic minority texts such as Tibetan, Mongolian, and Uyghur.
CodePage is a ing table between text encoding and Unicode in different countries. For example, the ing table between GBK and Unicode is cp936, so cp936 is also commonly used to refer to GBK.

3. Unicode

ANSI has many code pages. internal codes of different code pages cannot be normally displayed on other code pages. Due to the inconvenience of communication and transmission caused by different codes between countries, ISO intends to abolish all regional encoding schemes and re-establish a global encoding scheme to uniformly encode all letters and symbols, it is referred to as "Universal multiple-octet coded character set", which is short for UCS (iso000046 ). At the same time, the unicode.org organization has developed its own global Unicode code. Since unicode2.0, Unicode has adopted the same font and Word Code as USC, the phase mainly uses UCS-2/Unicode 16-bit encoding.

4. UTF Encoding

UTF (UNICODE/ucstransfer format), a longer-length storage of UCS encoding method, mainly used to solve the problem of transmission of UCS encoding. Divided into UTF-7, UTF-8, UTF-16, UTF-32 and so on. UTF-8 is an 8-bit (one byte) UTF Encoding method, a character may be transmitted 1-6 times, the specific conversion relationship with Unicode/UCS is as follows:

Unicode (U +) UTF-8
U + 00000000-U + 0000007f: 0Xxxxxxx
U+ 00000080-U + 000007ff: 110XXXXX10Xxxxxx
U + 00000800-U + 0000 FFFF: 1110Xxxx10Xxxxxx10Xxxxxx
U+ 00010000-U + 001 fffff: 11110Xxx10Xxxxxx10Xxxxxx10Xxxxxx
U+ 00200000-U + 03 ffffff: 111110Xx10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx
U + 04000000-U + 7 fffffff: 1111110 x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

For example, the Unicode/UCs of "I" are encoded as "U + 6211" (01100010 00010001) and are between U + 00000800-U + 0000ffff. Therefore, they are encoded in three bytes, segment by rule: 0110 001000 010001, and then replaceX, Get 11100110 10001000 10010001, that is, "E6 88 91", that is, "my" UTF-8 code.
An interesting example:
Create a text file in the Windows notepad, enter the word "Unicom", save, close, and open it again. You will find that the text is not "Unicom", but a few garbled characters.
When you use NotePad to create a file, the default encoding is ANSI, and the input Chinese is the GB series encoding. the encoding of the word "Unicom" is:
C1 1100 0001
AA 1010 1010
CD 1100 1101
A8 1010 1000
Have you noticed? The starting part of the first two bytes, the third four bytes is "110" and "10", exactly the same as the two-byte template in the UTF-8 rule, so when you open notepad again, notepad mistakenly believes that this is a UTF-8 encoded file, Let us remove the first byte 110 and the second byte 10, we get the "00001 101010", and then align you, add the leading 0 to get "0000 0000 0110 1010", which is the Unicode 006a, that is, the lowercase letter "J ", the second two bytes After decoding with a UTF-8 is 0368, and this character is nothing. This is why files with only the words "Unicom" cannot be normally displayed in notepad.
If you enter a few other words after "Unicom", the encoding of other words may not necessarily start with 110 or 10, notepad will not insist that this is a UTF-8 encoding file, and will be interpreted in ANSI mode, then garbled and does not appear.

5. UTF-16

The UTF-16 is a UTF Encoding Method that transfers two bytes at a time, and now Unicode/UCS are mainly 16-bit encoding, so the UTF-16 is stored in the same way as Unicode/UCS encoding. ExactIt is said to be encoded in the same way as UCS-2/Unicode 16.

6. Big endian and little endian

these two options are often seen in UTF-16 or ucs' encoding, where big endian and little endian are different ways for CPUs to process the number of multi-word segments. For example, the Unicode/UCS encoding of the Chinese character is 6c49. When I write a file, do I write 6C in front or 49 in front? If you write 6C in front, it is big endian. Write 49 in front, that is, little endian.
Bom is called "byte order mark ". The UTF-8 is encoded in bytes and there is no issue of bytecode. The UTF-16 uses two bytes as the encoding unit. before interpreting a UTF-16 text, you must first find out the byte order of each encoding unit. For example, if you receive a "queue" Unicode/UCS encoded as 594e and "B" Unicode/UCS encoded as 4e59. If we receive the UTF-16 byte stream "594e", is this "Kui" or "B "?
a character named "Zero Width no-break space" is included in Unicode/UCS encoding, and its encoding is feff. Fffe does not exist in Unicode/UCOS, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ". If the recipient receives feff, it indicates that the byte stream is big-Endian. If fffe is received, it indicates that the byte stream is little-Endian . Therefore, the character "Zero Width no-break space" is also called Bom.
the UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf. So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding. Windows uses BOM to mark the encoding of text files.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.