Unicode Character Set and Its Encoding Method

Last Update:2014-09-02 Source: Internet

Author: User

Tags format definition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before the official content starts, let's first understand a basic concept, encoding character set.

Encoding character set: the encoding character set is a character set that assigns a unique number to each character. The core of the Unicode standard is an encoding character set. The letter "a" is encoded as 004116 and the character "character" is encoded as 20ac16. Unicode standards always use hexadecimal numbers, and the prefix "U +" is added before writing, so the "A" encoding is "U + 0041 ".

1 ASCII code

We know that all information in the computer is finally represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols, from 0000000 to 11111111.

In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now.

ASCII Code specifies the encoding of a total of 128 characters (accurately speaking, ASCII code is an encoding Character Set). For example, the space is 32 (Binary 00100000 ), the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of a single byte, and the first one must be set to 0. The last 128 are called extended ASCII codes. Currently, many x86-based systems support extended ASCII codes.

The last 256 extended codes of the 128 ASCII codes can be customized to represent special characters and non-English characters. gb2312 uses the following 128 extended characters to represent Chinese characters, [161,254] A total of 94 characters are used to form a dual-byte table that represents the simplified Chinese character "two-Byte.

2 Unicode Character Set

The English character ASCII character set is enough, but assuming that the characters in other languages in the world are counted, the ASCII code is obviously not enough, so the Unicode Character Set came into being.

Unicode maps these characters with numbers 0-0x10ffff. It can contain up to 1114112 characters, or contain 1114112 characters. The bitwise is the number that can be assigned to characters. UTF-8, UTF-16, and UTF-32 are code schemes that convert numbers to program data.

3 UTF-8

Http://zh.wikipedia.org/wiki/UTF-8

The Unicode Character Set only uniformly defines all characters and their corresponding unicode encoding values. How can we store and read this unicode encoding value in our program? Obviously, you can define that all unicode encoded values are stored in four bytes. In this case, the part of the character set corresponding to the ASCII code table in the Unicode encoding character set (only one byte is required for the Unicode encoding value) is a waste. In this way, UTF-8 becomes a perfect choice.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.

Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
-------------------- + ---------------------------------------------
0000 0000 ~ 0000 007f | 0 xxxxxxx -------- 7bit
0000 0080 ~ 0000 07ff | 110 XXXXX 10 xxxxxx ------- 11bit
0000 0800 ~ 0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx ----- 16bit
0001 0000 ~ 0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx ---- 21bit

UTF-8 encoding rules:

1> If the binary digits of the Unicode encoding value are less than or equal to 7, a single byte is used to represent the Unicode encoding value. The first byte is set to 0, the next 7 digits are the Unicode code of this symbol. Therefore, for English letters, UTF-8 encoding and ASCII code are the same.

2> If the binary digits of the unicode encoded value are greater than or equal to 8 and less than or equal to 11, two bytes are used to indicate that the first two digits of the first byte are set to 1, the second + 1 bit is set to 0, and the first two digits of the second byte are set to 10. The remaining unmentioned binary bits are all filled with the Unicode code corresponding to this symbol.

3> Similarly, for the symbol (n> 1) that requires n UTF-8 bytes, the first N bits of the first byte are set to 1, the nth plus 1 is set to 0, and the first two digits of the next byte are all set to 10. The remaining unmentioned binary bits are all filled with the Unicode code corresponding to this symbol.

UTF-8 encoding can contain up to four bytes. Therefore, it can contain up to 21 bytes of Unicode..

4. UTF-16

Http://zh.wikipedia.org/wiki/UTF-16

Hexadecimal encoding range	UTF-16 representation (Binary)	10-digit code range	Number of bytes
U + 0000---u + FFFF	XXXXXXXX	0-65535	2
U + 0000---u + 10 FFFF	110110 yyyyyyyyyy 110111 xxxxxxxxxx	65536-1114111	4

The advantage of UTF-16 over UTF-8 is that most characters are stored in fixed-length bytes (2 bytes)-0 flat (including all major characters) are expressed in this range, but the UTF-16 is not compatible with ASCII encoding.

The UTF-16 code is measured in 16-bit unsigned integers. Unicode encoding is recorded as U. The encoding rules are as follows:

Assume that u <0x10000, u UTF-16 code is U corresponding 16-bit unsigned integer (for easy writing, the following 16-bit unsigned integer is recorded as word ).

Suppose u ≥ 0x10000, we first calculate u '= U-0x10000, then u' written in binary form: yyyy yyxx xxxx, u UTF-16 encoding (Binary) is: 110110 yyyyyyyyyy 110111 xxxxxxxxxx.

Why can U' be written as 20 binary digits? The maximum size of Unicode is 0x10ffff. After 0x10000 is subtracted, the maximum value of U' is 0 xfffff. Therefore, it must be expressed in 20 binary digits. For example, Unicode code 0x20c30, after 0x10000 is subtracted, 0x10c30 is obtained and written as binary: 0001 0000 1100 0011. Replace Y in the template with the first 10 digits in sequence, and replace X in the template with the last 10 digits in sequence. The result is 1101100001000011 1101110000110000, that is, 0xd843 0xdc30.

According to the above rules, Unicode code 0x0000-0x10ffff UTF-16 code has two words, the first word of the high 6 bits is 110110, the second word of the high 6 bits is 110111. It can be seen that the value range (Binary) of the first word is 11011000 00000000 to 11011011 11111111, that is, 0xd800-0xdbff. The value range (Binary) of the second word is 11011100 00000000 to 11011111 11111111, that is, 0xdc00-0xdfff.

To separate the UTF-16 encoding of a word from the UTF-16 encoding of two words, the Unicode encoding designer keeps 0xd800-0xdfff, known as the proxy zone (surrogate ):

D800-DB7F sans high surrogates sans high substitution

DB80-DBFF limit high private use surrogates limit high private alternative

DC00-DFFF lower low surrogates lower position substitution

High substitution means that the code bit in this range is the first word of the UTF-16 code of two words. Low substitution means that the bitwise of this range is the second word of the UTF-16 code of two words. So what does high-end dedicated substitution mean? Let's answer this question and, by the way, see how unicode encoding is derived from UTF-16 encoding.

Assuming that the first word of a character's UTF-16 encoding is between 0xdb80 and 0xdbff, in what range is its unicode encoding? We know that the value range of the second word is 0xdc00-0xdfff, so the UTF-16 encoding range of this character should be 0xdb80 0xdc00 to 0 xdbff 0 xdfff. We will write this range as binary:

1101101110000000 11011100 00000000-1101101111111111 1101111111111111

Take the last 10 digits of high and low word and put them together.

1110 0000 0000 0000-0000 1111 1111 1111 1111

That is, 0xe0000-0xfffff. Add 0x10000 to the opposite of the encoding to get 0xf0000-0x10ffff. This is the Unicode encoding range of the first word in UTF-16 encoding between 0xdb80 and 0xdbff, that is, the plane 15 and the plane 16. Since the Unicode standard uses both plane 15 and plane 16 as the dedicated zone, the reserved code bit between 0xdb80 and 0xdbff is called a high-level dedicated alternative.

5. Analysis of standard unicode encoding tables

Unicode Character plane ing:

Http://zh.wikipedia.org/wiki/Unicode%E5%AD%97%E7%AC%A6%E5%B9%B3%E9%9D%A2%E6%98%A0%E5%B0%84

Full unicode encoded table visibility link: http://zh.wikibooks.org/wiki/Unicode

Currently, Unicode characters are divided into 17 groups, each group is called a plane, and each plane has 65536 (216) code points. However, only a few planes are currently used.
The above unicode encoding table links only list a few used planes.

U +	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
0000	Nul	Soh	STX	Etx	EOT	Enq	ACK	Bel	BS	HT	Lf	Vt	FF	Cr	So	Si
0010	DLE	DC1	DC2	DC3	Dc4	Nak	SYN	Etb	Can	Em	Sub	ESC	FS	GS	RS	Us
0020	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
0030	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
0040	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
0050	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_

The table is divided into two columns: X and Y to determine the unique Unicode value (hexadecimal ). For example:
The ESC character is determined by X coordinate 0010 and Y coordinate B, so its unicode encoded value is 0010 + B = 001b
This is how the table is read.

6. UTF-8 and UTF-16 in byte order

Http://zh.wikipedia.org/wiki/UTF-8

Http://zh.wikipedia.org/wiki/UTF-16

I checked it online, and the reason for the byte order of the two is very superficial. The introduction of byte order is as follows:
The UTF-8 is encoded in bytes and there is no issue of bytecode. The UTF-16 is encoded in two bytes, so there is a problem of byte order.
So why is there no byte order for UTF-8 encoded units? It can contain three or four bytes. I think many people may ask this question, including myself. Let's look back at what the byte sequence is. For a variable type with a single byte length, it has no problem with bytes. For variable types with multi-byte length, there is a byte problem.

It is assumed that the problem of digital transmission between multiple CPUs is involved. For example, digital transmission through networks, files, and bus must be considered. During network transmission, when data is transmitted on platforms of different byte sequences, the data must be converted to the network byte sequence before transmission, that is, the network byte sequence (that is, the big data) is used for unified transmission. The receiver converts the network byte order to the local byte order based on the byte order of its CPU. As a result, the receiver reads the data in the byte order in the same way as the sender reads the data. Therefore, for network transmission, the problem of byte order at the network level has been solved. The problem of byte order at the Unicode encoding level is the same as that in the following file storage. Bus, this is the situation of data exchange between multiple CPUs on the same platform (the byte order of each CPU is different), I think we basically cannot come into contact with this situation, do not study. The rest is the case of file data transmission, that is, files on one platform are opened on another platform with different bytecode. In this case, how should we consider the issue of byte order?

Back to the detailed analysis "UTF-8 byte as the encoding Unit, there is no bytecode problem. The UTF-16 is encoded in two bytes, so there is a problem of byte order. "The meaning of this sentence. What is an encoding unit? Unicode is only a collection of symbols, according to The UTF-8 or UTF-16 can calculate a specific symbol corresponding to the unique binary code, but how to store the binary code details, this encoding rules are not specified. But the UTF-8 specifies that when storing this binary code, it must be read and stored in byte units; The UTF-16 specifies that when storing this binary code, read and store data in two bytes. At this time, we should know that the actual encoding unit is the unit that minimizes the reading and storage of its binary values in a certain encoding method. It is very important to clarify this.

Whether it is a UTF-8 or a UTF-16, its final binary encoding length may be greater than or equal to two bytes. However, the UTF-8 is based on bytes as the encoding unit, so the storage is based on the encoding order (Binary values from left to right) to store. Therefore, when reading a file, the same binary encoding can be obtained after one byte is read according to the storage order of the file. Therefore, you do not need to consider the byte sequence. While the UTF-16 is based on two bytes as the encoding unit, its binary encoding only has two or four bytes of length. Therefore, during storage, two bytes are stored to the file at a time according to the encoding sequence (the binary value ranges from left to right) until all files are saved. In this way, the storage order between the two bytes and the next two bytes is consistent with the binary encoding order. But how do we put the order of these two bytes as the encoding unit? UTF-16 is not specified, you can be based on your own preferences to artificially store the two bytes by the big head, also can store according to small head. So you according to the UTF-16 encoding after the file, when others read the content of your file, is according to the big head or small head to read this encoding unit?

If there is a bit dizzy, I will give another example:
If you want to convert U + 64321 (16 carry) into UTF-16 encoding. because it exceeds U + FFFF, it must be compiled into a 32-bit (4 bytes) format, as shown in the following figure: 11
V = 0x64321
VX = V-0x10000
= 0x54321
= 0101 0100 0011 0010 0001

Vl = 01 0101 0000 // 10 bits of the high part of VX
Vl = 11 0010 0001 // 10 bits of VX's low position
W1 = 0xd800 // the first 16 initial values of the result
W2 = 0xdc00 // the initial value of the last 16 digits of the result

W1 = W1 | VL
= 1101 1000 0000 0000
| 01 0101 0000
= 1101 1001 0101 0000
= 0xd950

W2 = W2 | VL
= 1101 1100 0000 0000
| 11 0010 0001
= 1101 1111 0010 0001
= 0xdf21

So the last correct UTF-16 binary code for this word U + 64321 should be:
1101 1001 0101 00001101 1111 0010 0001
The green part is 16 bits, and the red part is 16 bits.
Since the encoding unit of the UTF-16 is two bytes, when stored, according to the binary encoding order, first store the two high byte 1101 10010101 0000, and then store the status of the two byte 1101 1111 0010.
But what is the order between two bytes? The order is to first store the "low address" Byte in two bytes, and then store the high address byte.

If it is a small header Order (the lowest byte is in the lowest bit, and the highest byte is in the highest bit), the low byte is stored in the "low address. Therefore, in the two high-level bytes, 50 bytes of the maximum bit are first stored, and then D9 is the highest bit, that is, the hexadecimal format is stored as 50d9. Similarly, the storage order of two low bytes is 21df. The storage content is 50d9 21df.
If it is a large-headed Order (the highest byte is in the second bit of the address, and the lowest byte is in the highest bit of the address), the "low address" stores the high byte. Therefore, a maximum of 50 bytes are stored before the D9 byte with a maximum of two bytes.
Similarly, the storage order of the two low bytes is df21. When four bytes are connected, the stored content is d950 df21.

7. How to Solve the Problem of UTF-8 and UTF-16 in byte order
After the above analysis, we believe that we should be clear about the problem of byte order. So how can we let other programs know the byte order when reading the files you write?
What is added at the beginning of each file content in the Unicode specification? A character indicating the encoding sequence. The name of this character is "Zero Width, non-wrap space" (Zero Width, no-break space). The Unicode Character corresponding to this character is encoded as feff. So for the UTF-16, if you use a small header order when writing, the storage order of this character in the file is fffe; if it is a large order, the storage order of this character is feff.
UTF-8 is originally unrelated to the byte order and does not need to be specified for encoding. However, BOM (byte order mark) can be used to indicate the encoding method. The UTF-8 encoding of the character "Zero Width no-break space" is ef bb bf, because it is stored in bytes as the encoding unit, so the corresponding storage format in this character file is the same as the encoding format. Based on this, we know that, assuming that the file content starts with ef bb bf, we know it's a UTF-8 code.
8. Conversion of different encoding methods
In Windows, notepad supports storing content in files in different encoding methods.
ANSI is the default encoding method in Notepad ----- ASCII encoding for English files and gb2312 encoding for simplified Chinese files (for Windows Chinese Simplified version only, if it is a traditional Chinese version, the big5 code will be used ).
Notepad also supports the UTF-8 format, which is saved in the ANSI and UTF-8 encoding modes in sequence, we can see the conversion between the two encoding methods. Use the hexadecimal format in the text editing software ultraedit to observe the values corresponding to the different encoding methods of the file.

9. How can I infer the encoding format of a file when I read or write a text file?

1. First, the program inferred the file encoding format (BOM bytes) by inferring the several bytes of the file header)

ANSI: No format definition;
UNICODE: The first two bytes are fffe;
Unicode big endian: The first two bytes are feff;
UTF-8: The first two bytes are efbb;

BOM of UTF-16 big endian: FF Fe;
BOM of UTF-16 little endian: Fe ff;

2. Determine if the BOM does not exist.

Determination of UTF-8 based on content

UTF-8 encoding rules:

Character byte length mark byte value

One-word term length 0 xxxxxxx

Two bytes long: 110 XXXXX 10 xxxxxx

Three-character section length: 1110 XXXX 10 xxxxxx 10 xxxxxx

Four-byte length 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

** Data used for byte determination **

Define the length of the array bthead to 4. Save the decimal value used to determine the flag byte: 0,192,224,240

Define the length of the array btbitandvalue to 4. Save it to obtain the decimal value of the length of the flag byte: 128: 224,240,248.

** Data used for value determination **

Define the variable btvaluehead to save the decimal value corresponding to the value mark: 128

Define the variable btfixvalueand to save the decimal value of the flag used to obtain the value: 192

A. Read the file content in bytes and save it to the byte array.

B. Perform loop operations on the file content read in.

First, bitwise AND operation are performed on the Current byte and the four values in btbitandvalue respectively. The value obtained each time is greater than the value in bthead, when an equal value is found, the length of the character l can be determined based on the current value. And run the next loop. When B is skipped (L-1 ),

C. Mark of obtaining value. Bitwise AND operation is performed on the resulting value and btfixvalueand the obtained value is compared with the btvaluehead. If the obtained value is equal, the C operation continues on the next byte until the number of times it runs is the L-1. Assuming they are not equal, it means they are not UTF-8 encoding formats.

The determination of UTF-16 is similar to the determination of UTF-8, only to know the encoding rules can be.

Unicode Character Set and Its Encoding Method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More