Chatting about characters, character sets, and encoding (II)

Source: Internet
Author: User

In the first part of this article, I will give a general introduction to what character sets are and the development history of several character sets related to Chinese characters. In the following section, I plan to talk about the two things mentioned at the end of the previous article. One is ucus and the other is UTF.

It should be noted in advance that Unicode information is scattered on the Internet, and some information is widely spread but outdated, and even some documents do not indicate the time point of writing this article, authoritative sources of information, such as Wikipedia, are ambiguous and even conflicting interpretations of certain terms. Therefore, what I wrote in this section will be a mixture of network information and my personal knowledge, and out of the word "chatting" in the question, I will not indicate the source of some information completely.

UCOS is the abbreviation of the universal character set. It is also the abbreviation of the Unicode Character Set. It is said that the Unicode official name is the universal multiple-octet coded character set, and can also be used to write the UCS. By now, I am almost dizzy. please let us know. I personally prefer the former, because from the information of various aspects, the UCS are a term closely related to iso000046, which is rarely used in Unicode-related contexts, it may also be the misuse of some articles that are not so rigorous in writing.

There are two types of UCs, UCS-2 and UCS-4. Two bytes to the ucus character encoding scheme, called UCS-2, likewise, with four bytes to the ucus character encoding scheme, called UCS-4. The two encodings of the ucos are fixed. That is to say, if the number of characters is specified, the number of bytes used to encode these characters is determined. The difference is simply multiplied by two or four. Only Unspecified Code points (that is, no definite characters are assigned to a code point) may exist in the UCs, and no unusable code points exist. As you can see from these, the UCS-2 can only represent up to 65536 characters.

In my impression, Unicode was initially exposed to Windows as a 65536-bit code that can hold almost all the characters in the world. This recognition has been deeply imprinted on my brain. Even later, I knew that the number of Chinese characters alone has exceeded 65536 and reached more than 70 thousand, nor can I doubt that the two bytes are not enough to encode all Unicode characters. However, according to the Unicode 1.1 information I just found, the Code point is defined only U + fffd and has not been removed from the 65536 range, the two bytes were enough at the time (so we can see that my cognitive mistake is that I was not able to keep pace with the times, but I do not know that later
Unicode development ).

If you are only interested in the code points of a character in the Unicode Character Set, you obviously do not need to consider how many bytes are used to represent the value of this Code Point, but when you want to implement it in a computer program and store or transmit it, you must determine the space occupied by a code point. Just like in the C language, I would like to give you a number 1. Do you plan to use a char type data, a short data, or an int or even long data? In this case, the Unicode specification provides three official solutions: UTF-8, UTF-16, and UTF-32. The numbers following the UTF letters are the digits occupied by the minimum encoding unit in each scheme. Divided
8 is the number of bytes. That is to say, if it is a UTF-8, encoding a Unicode character will use at least 1 byte, UTF-16 and UTF-32 is 2, 4 respectively. Please note that it is the least mentioned and has not been capped yet. So what is the biggest? In this paper, Due to space limitations, do not give specific coding rules of each scheme, just let us look at the conclusion: UTF-8 coding, currently, the maximum length of a character is 3 bytes (of course, it may also be 2), if you use a UTF-16, then the encoding of a Unicode character may be 2 bytes or 4 bytes (not 3), if the UTF-32, it is fixed, is 4 bytes. Readers with a solid computer base should be confused immediately here, because they will think that data of more than one byte will certainly have a problem of the byte order during the arrangement. That's right. So, UTF-16.
There are two forms, called UTF-16LE and UTF-16BE, respectively corresponding to the small-end byte order and the large-end byte order, UTF-32 is the same. If you are confused about the byte order, refer to the byte article. For the vast majority of the two bytes can represent the characters, the UTF-16 and UCS-2 encoding is the same, this vast majority of the number, is 96.9%. But the advantage of UTF-16 is that for characters that exceed two bytes to indicate the range, it also has a way to encode, just increase the length of the encoding, and
UCS-2 is powerless.

Once the expression of a character has reached the point to use 4 bytes, the appearance of harmony began to appear, UCS-4 and UTF-32 into the same thing, no difference, each other alias. The ISO and Unicode organizations have reached an agreement even on code points for specifying new characters in the future to ensure mutual compatibility and long-term stability.

Finally, by the way, there have been some other coding schemes in history, such as UTF-1, UTF-7 (and even UTF-9 and UTF-18, but do not pay attention, these two are sent out as a joke of the fool's day. Don't take such actions as some domestic media. If they are true, most of these things are no longer fresh, unless necessary, you do not need to have a deep understanding. Of course, it is fine if you want to cultivate yourself into a computer historian.

Two terminologies related to UTF are not explained in this article. One is Bom, the full name is byte order mark, that is, byte order mark, and the other is zwnbsp, the full name is Zero Width non breaking space, is Zero Width, non-line break blank. If you are interested, you can search by yourself. They do not cause much ambiguity and are too detailed, so they are not the object of this article.

This is the general introduction of the UCs and UTF. The last article is not much written, but it takes a lot of effort because it considers what information should be put in, it is not easy to give up. I hope that my initial intention can be reached, so that my friends in this article can learn about these concepts and terminologies in a short time. If so, lucky enough.

2. /~ Mgk25/unicode.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.