ISO8859-1, UTF-8 and GB2312

Source: Internet
Author: User
Tags control characters ranges

ISO8859-1, usually called Latin-1. Latin-1 includes additional characters that are indispensable for writing all Western European languages. Gb2312 is a standard Chinese character set. But the ISO 10646 code has the following problem: the UTF-16 or Unicode is a 16-bit fixed length encoding, and there is no larger room to accommodate than the Big5 or GB2312 code. The 8-bit variable-length (variable-length) encoding uses three tuples (bytes) for each Chinese text ). This means that an XML file encoded with a UTF-8 is 50% larger than a file encoded with a Big5 code. However, if you use
ASCII code mark (Markup), the file will not be so large. The mark will account for about 50% of the file. The possible way to reduce the file size is to compress the volume with the file. In ISO 10646, the word order is different from any Chinese code. You cannot use a simple algorithm (algorithm) to convert Big5 or GB2312 code to ISO 10646. You must use a conversion table for transcoding. On the other hand, the Chinese collation of ISO 10646 is conducive to sorting ). Besides repeated words, it is also helpful for searching. (It is said that the GBK word set contains
All characters in ISO 10646 code, and keep the same collation as GB2312 code. In some cases, it may be a good word set .)

The UTF-8 is a variable-length UNICODE encoding, RFC 3629. To put it simply, it is a big character set. It can solve the problem of text display in multiple languages, so as to realize the internationalization and localization of applications. For the system, UTF-8 encoding can be quickly read and written by shielding bit and shift operations, sorting is easier. The UTF-8 is unrelated to the byte order, and its byte order is the same in all systems. UTF-8 is a kind of UTF-8 encoding which is widely used in Web pages. It is actually a Unicode encoding, that is, it is committed to incorporating all the languages around the world into a unified encoding. The former UTF-8 has included several important Asian languages, including simplified Chinese and Japanese and Korean characters. So UTF-8
It has higher performance. However, if it is only in English, you can use anything. It is no problem to use GB2312.

GB2312 is a simplified Chinese code. When an article/webpage contains traditional Chinese, Japanese, or Korean, the content may not be correctly encoded. GB2312 Chinese characters are dubyte characters. The so-called double BYTE means that a double character occupies two bytes (that is, 16 bytes), which are called high and low. Chinese character encoding is GB2312, which is mandatory. Currently, almost all applications that can process Chinese characters support GB2312. GB2312 includes level 1 and level 2 Chinese characters and Zone 9 characters. The height ranges from 0xa1 to 0xfe and from 0xa1 to 0xfe. The encoding range of Chinese characters is 0xb0a1 to 0xf7fe.

 

Summary:

Unicode Consortium (Unicode Consortium) is a massive collection of words jointly determined by many companies, with the participation of Asian companies. Example: Fuji and Fuji Xerox. The organization uses the ISO 10646 word set and adds other information: Standard names and features. Unicode contains all the characters in GB2312 and (possibly) all the characters in the Big5 code. It also includes many other languages. (ISO 10646 has several encoding methods: The UTF-8 is 8-bit, And the UTF-16 is 16-bit.
Unicode is the form of a UTF-16.
Unicode is better than Big5 and GB2312 because Unicode contains many characters.

ISO Character Set

The "A" in ASCII represents the United States, so it is not surprising that the ASCII code is specially used for English writing. The ASCII code contains characters, U, and ,? And many characters required to write in other languages and regions.

You can extend the ASCII code by specifying more characters After 128. The International Standards Organization (ISO) defines several different character sets that are based on ASCII codes that are required by other languages and regions. The most prominent is the ISO8859-1, usually called Latin-1. Latin-1 includes the additional characters necessary to write all Western European languages, 0 ~ The character 127 is the same as the ASCII code. Table 7-2 provides 128 ~ 255 characters. Similarly, the first 32 characters are rarely used as non-printable control characters.

So the conversion between ISO8859-1 and GB2312 will be troublesome

Because Unicode is usually used for conversion between different languages. Assume that there are two different languages A and B. The conversion procedure is: convert A to Unicode first, and then convert Unicode to B.

For example, there is a Chinese character "Li" in GB2312, its encoding is "C0EE", to convert to ISO8859-1 encoding. Step: first convert the word "Li" into Unicode, get "674E", then "674E" into ISO8859-1 characters. Of course, this ing won't succeed, because the root in the ISO8859-1 has no character corresponding to "674E.

In general, one is the Chinese language and the other is the Western European language.
------------------------------------------------------------------------------
Iso-8859-1 is the standard character set used for JAVA network transmission, while gb2312 is the standard Chinese Character Set, when you make the submit form and other operations that require network transmission, you need to convert the iso-8859-1 to gb2312 character set display, otherwise if according to the browser gb2312 format to explain the iso-8859-1 character set, because the two are not compatible, so it will be garbled.
------------------------------------------------------------------------------
The Chinese character is dubyte. The so-called double BYTE means that a double character occupies two bytes (that is, 16 bytes), which are called high and low. Chinese character encoding is GB2312, which is mandatory. Currently, almost all applications that can process Chinese characters support GB2312. GB2312 includes level 1 and level 2 Chinese characters and Zone 9 characters. The height ranges from 0xa1 to 0xfe and from 0xa1 to 0xfe. The encoding range of Chinese characters is 0xb0a1 to 0xf7fe.

There is another encoding called GBK, but this is a specification, not mandatory. GBK provides 20902 Chinese characters. It is compatible with GB2312 and the encoding range is 0x8140 to 0 xfefe. All characters in GBK can be mapped to Unicode 2.0 one by one.

In the near future, China will adopt another standard: GB18030-2000 (GBK2K ). It contains fonts of ethnic minorities such as Tibet and Mongolia, and fundamentally solves the problem of insufficient characters. Note: It is no longer a fixed length. The second part is compatible with GBK, and the four parts are expanded characters and fonts. The first and third bytes are from 0x81 to 0xfe, and the second and fourth bytes are from 0x30 to 0x39.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.