Character Encoding Basics

Source: Internet
Author: User

1. The concept of character set and character encoding

Character set: A character corresponds to a value, so a table is called a character set

Character encoding: How a character is stored in memory, takes a few bytes, and stores what in each byte

2. ASCII code (Basic code in English)

A byte (byte) has 8 bits of bits (bit), each bits has 0 and 12 states, so each byte can be combined into 256 states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.

In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far.

The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.

For English characters and numeric characters, the ASCII code is basically sufficient.

3. Local language encoding

For other languages, 128 symbols are far from enough. So languages in many countries have their own generic codes.

For example:

English code: GB2312, later extended to GBK. The encoding scheme for GB2312 (GBK) is to use two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.

Traditional Chinese code: BIG5. The encoding scheme is a double-byte storage.

4.Unicode

Unicode is a character set that is created to address the limitations of traditional character encoding schemes, and sets a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. The Unicode character set covers all the characters currently used by humans, and each character is assigned a uniform number, assigning a unique character code, each of which has a different encoding, for example, u+0639 means that the Arabic letter ain,u+0041 represents the English capital letter A,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table. The Unicode character set divides all characters into 17 levels (Plane) on a per-use basis, with 216 = 65,536 character code space at each level.

The No. 0 level of BMP (basic Multilingual plane base multilingual plane), basically covers all the characters used in today's world. Other dimensions are either used to denote some ancient words or to be extended. The Unicode characters we normally use are usually located on the BMP level. There is still a large number of character spaces in the Unicode character set that are not used.

Unicode is just a set of symbols that specifies only the binary code of the symbol, but does not specify how the binary code should be stored . Unicode is encoded in a number of ways, common as UTF-8 UCS-2 UTF-16

5.ucs-2

UCS-2 is a Unicode encoding : A Unicode code that is stored directly in characters with two bytes. UCS-2 originally designed only to consider BMP characters, so the use of a fixed 2-byte length, that is, he could not represent the Unicode characters on other levels,

UCS-2 encoding is divided into big endian and little endian two ways.

For example: Medium "Unicode character code is 0x4e2d (01001110 00101101), then we can encode to 01001110 00101101 (Big endian, first byte 4E, second byte 2D) or 00101101 01001110 (small end Little endian, first Byte, zero, second byte 4E).

6.utf-16

UTF-16 is also a Unicode encoding that requires two bytes or four bytes to store characters. UTF-16 can be seen as the parent set of UCS-2. In BMP characters (u+0000. U+D7FF and u+e000. U+FFFF), and the UCS-2 is exactly the same. The section from u+d800 to U+dfff is permanently preserved from being mapped to characters, UTF-16 takes advantage of this preserved 0XD800-0XDFFF segment to the secondary plane (u+10000. The code point of the character within the U+10FFFF) is encoded. The characters in the auxiliary plane (supplementary Planes) are encoded in UTF-16 as a pair of 16bit long code Unit (that is, 32bit,4bytes), called a proxy pair (surrogate Pair). The specific agent pair (surrogate pair) design, please check other information.

7.utf-8

UTF-8 is one of the most widely used Unicode encoding methods on the Internet.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

There are two coding rules for UTF-8:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same .

2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.

Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the above table, it is very simple to interpret UTF-8 coding. If the first bit of a byte is 0, then the byte is a single character, and if the first bit is 1, how many consecutive 1 is the number of bytes that the current character occupies.

Taking the Chinese character "Xu" as an example, demonstrates how to implement UTF-8 coding.

"Xu" Unicode is 5f90 (101111110010000), 5f90 in the third row (0000 0800-0000 FFFF), so "Xu" UTF-8 encoding requires three bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Xu", the X in the format is filled in sequentially, and the extra bit is 0. "Xu" UTF-8 code is "11100101 10111110 10010000", converted into 16 binary is e5,be,90.

8.Windows encoding Selection in Notepad

There are four options: Ansi,unicode,unicode big endian and UTF-8.

1) ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).

2) Unicode encoding refers to the UCS-2 encoding method, which is a Unicode code that is stored directly in characters with two bytes. This option uses the little endian format.

3) Unicode big endian encoding.

4) UTF-8 encoding.

After selecting the "Encoding mode", click "Save" button, the file encoding method will be converted immediately.

Reference article:

Character-coded notes: Ascii,unicode and UTF-8 http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

About character encoding, all you need to know about http://www.cnblogs.com/KevinYang/archive/2010/06/18/1760597.html

Character Encoding Basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.