[Conversion] a Chinese character of UTF-8 occupies three bytes of length and UTF-8 bytes.
The answer from Baidu is more vivid and impressive, so I will take a note.
Original link https://zhidao.baidu.com/question/1047887004693001899.html
Zhi Hu also has a clearer answer to https://www.zhihu.com/question/23374078
1. Americans first encode their English characters, that is, the earliest ascii code, which uses a low 7-bit byte to represent the English 128 characters, and the 1-bit height is 0;
2. Later, the Europeans discovered that Nima was enough for your 128 digits. For example, there are notes on the letter of my noble French people. How can we distinguish them? Well, compile the first high digit, in this way, Europe generally uses a full Byte encoding, which can represent a maximum of 256 bits. Europeans and Americans like to go straight, with fewer characters and fewer digits for encoding;
3. Even if the number of digits is small, different countries and regions use different character codes. Although 0--127 represents the same symbol, the interpretation of 128--255 is completely messy, even if the binary system is identical, the characters are completely different, for example, 135 are completely different characters in French, Hebrew, and Russian encoding;
4. what's even more troublesome is that after Nima's computer technology was passed to China, the Chinese found that we have more than 0.1 million Chinese characters, and you are not enough to crack these 256 characters in Europe and America. Therefore, we invented the GB2312 Chinese character encoding. Typically, we use two bytes to represent the vast majority of commonly used Chinese characters. It can represent up to 65536 Chinese characters, it's hard to understand that you can find some Chinese characters in the Xinhua Dictionary, but you won't be able to display them if you don't handle them on the computer.
5. How can we unify the world by encoding character sets? The Russians sent an email to the Chinese. The character set codes on both sides are different. Nima shows garbled characters. In order to unify, unicode was invented, and all the symbols in the world were included. Each symbol is uniquely encoded. Now unicode can contain more than 1 million symbols, the codes of each symbol are different, so they can be unified. All languages can communicate with each other. A web page can display texts of different countries at the same time.
6. However, although unicode unifies the binary encoding of All characters in the world, there is no rule on how to store unicode. The computer small-end and big-end orders of x86 and amd architectures are unclear, not to mention how computers identify unicode or acsizii. If Unicode stipulates that each symbol is represented by three or four bytes, two to three bytes before each English letter must be 0. Therefore, the size of the text file is two or three times larger, this is a great waste for storage. This leads to a consequence: there are multiple Unicode storage methods.
7. In the rise of the Internet, various characters must be displayed on webpages. UTF-8 is one of the most important Unicode implementations. In addition to UTF-16, utf-32 and so on. UTF-8 is not a fixed character length encoding, but a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols. This is a clever design. If the first byte is 0, the byte is a single character. If the first byte is 1, the number of consecutive values is 1, the number of bytes occupied by the current character.
8, pay attention to unicode character encoding and UTF-8 storage encoding representation is different, for example, "strict" word Unicode code is 4E25, UTF-8 encoding is E4B8A5, which 7 explains, UTF-8 encoding not only considers the encoding, but also considers the storage, E4B8A5 is in the storage of the identification code on the basis of the 4E25.
9. The UTF-8 uses one to four bytes to encode each character. 128 ASCII characters (Unicode ranges from U + 0000 to U + 007F) in only one byte, two bytes are required for Latin, Greek, Spanish, Arabic, Hebrew, Arabic, Syrian, and Maldives with a variant symbol (Unicode ranging from U + 0080 to U + 07FF, the characters (CJK belongs to this class-Qieqie note) in other basic multilingual planes (BMP) use three bytes, and the characters in other Unicode secondary planes use four-byte encoding.
10. Finally, I want to answer your question. Generally, Chinese characters in UTF-8 occupy several bytes, usually three bytes, the most common encoding method is 1110 xxxx 10 xxxxxx 10 xxxxxx.
Unicode symbol range | UTF-8 encoding method
(Hexadecimal) | (Binary)
----------------------
0000 0000-0000 007F | 0 xxxxxxx
0000 0080-0000 07FF | 110 xxxxx 10 xxxxxx
0000 0800-0000 FFFF | 1110 xxxx 10 xxxxxx 10 xxxxxx
0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
According to the above table, the interpretation of UTF-8 encoding is very simple. If the first byte is 0, the byte is a single character. If the first byte is 1, the number of consecutive 1 represents the number of bytes occupied by the current character. Next, take Chinese character "Yan" as an example to demonstrate how to implement UTF-8 encoding. It is known that the unicode of "strict" is 4E25 (100111000100101). According to the preceding table, we can find that 4E25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 xxxx 10 xxxxxx 10 xxxxxx ". Then, from the last binary bit of "strict", enter x in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is E4B8A5.