Introduction to Character set encoding in Java (i) the historical enmity between Unicode and UCS

Source: Internet
Author: User
Tags character set integer

ASCII and related standards

The earth people all know ASCII is the abbreviation of the American Standard Information Interchange code, also know that the ASCII stipulation uses 7 digits binary numeral to represent English character, the ASCII is designated as international standard after the code name is ISO-646. Since the ASCII code uses only 7 bits, that is, a byte can represent 256 digits, it uses only the 0~127 128 code bits, the remaining 128 code bits can be used to expand, to represent some specific language unique characters, Therefore, a series of iso-8859-* standards have been formed for the different extensions of the redundant 128 code bits. For example, a specially extended character set encoding standard number for English is iso-8859-1, also known as Latin-1, the extension number for Greek is iso-8859-7, etc., the complete list can refer to "Java internationalization" book.

Unicode and UCS

The entire Unicode project was launched by a number of computer software companies, including companies in the publishing industry, since the early 80. The Earth people know that for Japanese, Chinese characters, 256 yards is far from enough (of course, at that time is not the Earth people know, at least the old United States to design computers do not know, even until today, there are old Americans think that the United States is the only country in the world). The solution is very intuitive also very obvious, that is, the use of code bit more than enough to contain the required number of characters coding scheme (that is, the saying goes, the stopgap). This is one of the goals of Unicode, can contain all the languages of the world characters (including Chinese characters, Japanese, mathematical symbols, musical symbols, but also include all kinds of strange and not understand things like hieroglyphics, Oracle, three representatives, the scientific concept of development, etc., laugh), this ideal, can be said very broad, But it was soon discovered that the original design of Unicode was not possible. Another design goal for Unicode, the far-reaching impact on today is the use of 16-bit encoding for all characters (that is, an integer number with a 16-second square size of not more than 2), note that in this sense Unicode is a coded character set, not a character set encoding. It is no compliment to say that this design goal has far-reaching implications for today, since even Unicode designers later discovered that 16-bit code only had 65,536 yards, far from accommodating all the characters in the world, but when aware of the problem, most of the Unicode specifications have been finalized, There is also a considerable degree of popularization, it is not realistic to completely overturn. This has become a legacy and the beginning of surrogate pair's lame solution.

Coincidentally, in 1984, the International Organization for Standardization ISO, which likes to fool the masses with a large number of numbers, also began to develop solutions to solve the problem of too many characters in different languages, which is called Universal Character Set (UCS), The official number is ISO-10646 (remember, ASCII is ISO-646, I wonder if this arrangement is intentional). or the ISO foresight, from the outset to determine the UCS is a 31-bit coded character set (that is, with a size of not more than 2 of the 31-square integer number for each character number), this is really enough to hold all the countries of the ages, all languages contain characters (yes, any country, any small language includes, Whether they establish diplomatic ties with Taiwan or establish diplomatic ties with China, they are advocating democracy or terrorism, so science has no borders. Although later they realized that 2 of the 31-time square code bit is too much ...

The world trend, a long time will be combined. Whether Unicode or UCS, the original purpose is to eliminate the various different forms of incompatible ways of the private extension code (good words), the result of the two sides to establish standards (initially these two standards are incompatible), and formed a separatist, which is detrimental to the construction of a harmonious society, Contrary to the main melody of peace and development in the world today, the Chinese government has always opposed hegemonism and power politics of any kind, and developed countries led by America ... It's a long way from pulling out. In 1991, the Unicode Federation and the ISO working group finally began to discuss the integration of Unicode and UCS, although the subsequent merger took many years and many of the encodings in the Unicode first edition specification needed to be rewritten, and UCS required restrictions on the use of the Code space, But the results are gratifying. Ultimately, the two unify the abstract character set (that is, any character that exists in Unicode, also exists in UCS), and the top 65,535 characters also unify the character's encoding. For code space, both agree to a limit of 1.1 million (ie both think that although 65536 is not enough, but 2 of the 31 times is too big, 1.1 million is a mutually acceptable size of the code space, but also enough, of course, the 1.1 million here is only a divisor, Unicode extended the code space to 1.1 million, And UCS will never use the code bit after 1.1 million forever. In other words, it is not right to say that Unicode contains only 65,536 characters. In addition to unifying the characters that have already been defined, the Unicode Federation and the ISO Working group agree that any future extension work will be synchronized, so while Unicode is historically not the same as UCS (or even the details), it now mentions Unicode, There is nothing wrong with referring to both.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.