Character Set and encoding

Last Update:2014-09-29 Source: Internet

Author: User

Tags control characters parse error

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Character Set and encoding

Abbreviations:

ASCII: American Standard Code for information interchange

UCs: Universal Character Set

UTF: Unicode/UCOS Transformation Format

ASCII code

The ASCII code is a 7-bit code with the encoding range of 0x00-0x7f. The ASCII character set includes English letters, Arabic numerals, punctuation marks, and other characters. 0x00-0x20 and 0x7f contain 33 control characters.

The system that only supports ASCII Code ignores the maximum bit of each byte and considers the low 7 bits as the valid bit.

It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example, E in French is encoded as 130 (Binary 10000010 ). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.

However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 represents é in French encoding, but gimel (?) in Hebrew encoding (?), It represents another symbol in Russian encoding. However, in all these encoding methods, the 0--127 represents the same symbol, but the difference is only the 128--255 section.

As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.

Unicode Character Set

As mentioned in the previous section, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.

As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.

Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. For example, U + 0639 represents the Arabic letter ain, U + 0041 represents the English capital letter A, and U + 4e25 represents the Chinese character "Yan ". You can query a specific symbol table at unicode.org or a special Chinese character table.

It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Unicode of Chinese character "Yan" is a hexadecimal number of 4 E25, and the number of bytes converted to binary is 15 (100111000100101). That is to say, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.

There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, two to three bytes in front of each English letter must be 0, which is a huge waste for storage. Therefore, the size of the text file is two or three times larger, which is unacceptable.

The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.

UTF-8 Coding

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 (characters are expressed in two or four bytes) and UTF-32 (characters are expressed in four bytes), but are basically not needed on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.

UTF-8 coding rules are very simple, only two:

1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.

2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.

The following table summarizes the encoding rules. The letter X indicates the available encoding bits.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000-0000 007f | 0 xxxxxxx
0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

According to the above table, the interpretation of UTF-8 encoding is very simple. If the first byte is 0, the byte is a single character. If the first byte is 1, the number of consecutive 1 represents the number of bytes occupied by the current character.

Next, take Chinese character "Yan" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the Unicode of "strict" is 4e25 (100111000100101). According to the preceding table, we can find that 4e25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10 xxxxxx ". Then, from the last binary bit of "strict", enter X in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is e4b8a5.

Little endian and big endian

These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The reason for the war was people's debate about whether to break out from big-Endian or from Little-Endian when eating eggs. There were six wars in front and back for this purpose. One emperor gave his life and the other emperor lost his throne.

Therefore, the first byte is in front of "Big endian", and the second byte is in front of "little endian ).

Naturally, a problem arises: how does a computer know which encoding method is used for a file?

As defined in the Unicode specification, a character indicating the encoding sequence is added at the beginning of each file. The name of this character is "Zero Width, non-line feed space" (Zero Width, no-break space ), expressed in feff. This is exactly two bytes, and FF is 1 larger than Fe.

If the first two bytes of a text file are Fe ff, it indicates that the file adopts the big header mode. If the first two bytes are FF Fe, it indicates that the file adopts the Small Header mode.

UTF-16 Coding

16-bit encoding. One character occupies 2 bytes;
Unicode preferred encoding, incompatible with ASCII;
CPU in byte order, can be divided into UTF-16LE and UTF-16BE;

UTF-32 Coding

32-bit encoding. One character occupies 4 bytes;

Summary

For English-only documents, or mixed documents, Chinese and English characters accounted for the majority, UTF-8 encoding is more advantageous (save storage space );
For pure Chinese documents, or mixed documents in the majority of Chinese characters, UTF-16 encoding is more advantageous;
UTF-8 needs to judge the beginning of each byte mark information, so if a byte in the transfer process error, it will cause the subsequent bytes will also Parse error;
And the UTF-16 will not judge the Starting Sign, even if the error will only be wrong one character, so the fault tolerance is stronger.

For example, the Unicode corresponding to the Chinese character "Han" is 6c49,
UTF-16 representation: 01101100 01001001
UTF-8: the decimal value is 27721, 3 bytes are required

The sign at the beginning of the text can be used to determine the corresponding encoding:
Ef bb bf UTF-8
Fe FF UTF-16/UCS-2, little endian
FF Fe UTF-16/UCS-2, big endian
FF Fe 00 00 UTF-32/UCS-4, little endian.
00 00 Fe FF UTF-32/UCS-4, big-Endian.

Note: UTF-16 and utf-32, if you use a char string, there will be 0 bytes in the end of the non-real (wchar_t can avoid this problem ).

UTF-8, UTF-16, and utf-32 are different encoding methods for Unicode character sets;

For Chinese, there is another set of character sets-location code. The location code character set has nothing to do with Unicode. The encoding methods of the Location Code mainly include gb2312 and GBK. We will continue to introduce it below.

Location Code Character Set

The Location Code divides the encoding table into 94 areas, each of which corresponds to 94 characters. The combination of the area code and the location code of each character is the location code of the Chinese character. Generally, the location code is represented by a 10-digit number. For example, if the value is 1601, it indicates 16-digit and 1-digit. The corresponding character is "ah ".

In the location code, the 01-09 area is the symbol and number area, the 16-87 area is the Chinese character area, and the 10-15 and 88-94 areas are undefined blank areas. It divides the recorded Chinese characters into two levels: the first level is 3755 commonly used Chinese characters, which are placed in the 16-55 area and arranged in the order of Chinese pinyin letters/PEN; the second-level Chinese characters are 3008 frequently used Chinese characters, which are placed in Area 56-87 and arranged in sequence by the beginning/strokes. First-level Chinese characters are sorted by pinyin. This gives you the range of a pinyin In the first-level Chinese character location. Many programs that can obtain pinyin based on Chinese characters are compiled based on this principle.

The location code should be considered as the definition of the character set, defining the included characters and character positions, while gb2312 and GBK are supported in the computer environment. The relationship between location code and gb2312 encoding is somewhat like Unicode Character Set and UTF-8 encoding.

Gbk2312 Encoding

Gbk2312 is a character set encoded in simplified Chinese. It is designed based on the location code. Add 0xa0 to the area code and bit Code respectively to get gb2312 encoding.

In addition to common simplified Chinese characters, the gb2312 Character Set also contains Greek letters, Japanese hirakana, Katakana letters, and Russian Spanish letters. You can use traditional Chinese characters to test whether some systems only support gb2312 encoding.

Gb2312 is dubyte encoding with the encoding range of 0xa1a1-0x7e7e. After undefined areas are removed, the actual encoding range is 0xa1a1-0xf7fe.
The EUC-CN can be understood as an alias for gb2312, which is exactly the same as gb2312.

GBK Encoding

GBK stands for the Chinese internal code extension specification (Chinese internal code specification ).

GBK is an extension of gb2312. In addition to its compatibility with gb2312, GBK can also display Traditional Chinese and Japanese Kana.
GBK adopts dual-byte representation. The total encoding range is 8140-fefe, the first byte is between 81-fe, And the last byte is between 40-fe, removing a line of xx7f.
A total of 23940 characters, with a total revenue of 21886 Chinese characters and graphical symbols, including 21003 Chinese characters and 883 graphic symbols.

Encoding conversion

#include <iconv.h>iconv_t iconv_open(const char *tocode, const char *fromcode);size_t iconv(iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);int iconv_close(iconv_t cd);

The iconv_open function declares the two types of encoding to be converted. A conversion handle is returned for the iconv and iconv_close functions.

Add the string "// Transcoder" after tocode. When a character in the source string cannot be expressed in the target character set, you can select one or more characters) it is represented by the most similar character;
Add the character "// ignore" after tocode. When a character in the source string cannot be expressed in the target character set, the character is discarded.

The iconv function converts the multi-byte sequence starting with * inbuf to the multi-byte sequence starting with * outbuf,
Read from * inbuf and convert up to * inbytesleft bytes. After conversion, write from * outbuf and write up to * outbytesleft bytes.
The iconv function converts multiple characters at a time. After each character conversion, * inbuf increases the number of converted bytes, * inbytesleft minus the number of converted bytes;
Correspondingly, * outbuf and * outbytesleft reduce the number of converted bytes.

Conversion fails in the following three cases:
1. If the input contains an invalid (invalid) Multi-byte sequence, set errno to eilseq and return-1, * inbuf pointing to the leftmost end of the invalid sequence;
2. The input end with an incomplete (incomplete) Multi-byte sequence, set errno to einval, and return-1, * inbuf pointing to the leftmost end of the incomplete sequence;
3. The output buffer does not have enough space to store the next character. In this case, set errno to e2big and return-1.

Wide character type CharSingle-byte character, sizeof (char) = 1char is an integer type, within the ASCII (7-bit) range value, corresponds to ASCII. Wchar_tWidth character, in Linux sizeof (wchar_t) = 4wchar_t no specification encoding, in Linux medium price in UTF-32BE.

Const wchar_t * P = l "ABC"; // The prefix L is telling the compiler that this is a wide character, UTF-8 to UTF-32BEprintf ("% ls", P ); encoded as 0x00000061 0x00000062 0x00000063

Wchar_t has a set of its own read/write functions, such:

Setlocale (lc_all, "zh_CN.UTF-8"); wchar_t A [10] = l "hello"; wprintf (L "this is a test! \ N "); wprintf (L" % d \ n ", wcslen (a); wprintf (L" % ls \ n ", );

It is unwise to put the Unicode Character Set encoded with UTF-8 into wchar_t, because, in this case, wchar_t may have two or less characters, or two and a half.

Character Set and encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More