Reprinted -- JAVA character encoding Series 1: Unicode, GBK, GB2312, UTF-8 concepts

Source: Internet
Author: User

Address: http://blog.csdn.net/qinysong/archive/2006/09/05/1179480.aspx

In the past two days, I took the time to summarize/sort out the actual encoding methods and usage of various encodings in Java applications. I will record them here for future reference. In order to form a complete understanding and in-depth understanding of text encoding, in order to deal with various problems encountered during Java development, especially the garbled problem, I think it is better to make up a series to describe and analyze, including three articles: First Article: JAVA character encoding Series 1: Unicode, GBK, GB2312, UTF-8 concept basics Article 2: JAVA character encoding Series II: Unicode, ISO-8859, GBK, UTF-8 encoding and mutual conversion Article 3: JAVA character encoding Series III: coding problems in Java applications Article 1: JAVA character encoding series I: unicode, GBK, GB2312, UTF-8 concept basis this part is reused, repost an article to achieve this part of the goal. Source: holen 'blog character encoding and Unicode, ISO 10646, UCS, UTF8, UTF16, GBK, GB2312 understanding address: http://blog.donews.com/holen/archive/2004/11/30/188182.aspx Unicode:

 

The encoding mechanism developed by unicode.org should include common texts all over the world.
In 1.0, It is a 16-bit code, from U + 0000 to U + FFFF. each 2byte Code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit is used as the basic bit plane, and the 16-bit plane is added, which is equivalent to 20-bit encoding, the encoding range is 0 to 0x10FFFF.

UCS:

The Universal Character Set defined in iso000046 according to ISO, which adopts 4 Byte encoding.

Unicode:

ISO and unicode.org are two different organizations, so different standards were initially developed. However, since unicode2.0, unicode adopts the same font and Word Code as ISO 10646-1, ISO also promises that the iso000046 will not assign a value to the UCS-4 code that exceeds 0x10FFFF, so that the two are consistent.

The encoding method of UCS:

 

  • UCS-2, which is basically the same as the 2 byte encoding of unicode.
  • UCS-4, 4 byte encoding, is currently added in front of the UCS-2 2 fully zero byte.

     

    UTF:Unicode/UCOS Transformation Format

  • UTF-8, 8bit encoding, ASCII do not change, other characters do Variable Length Encoding, each character 1-3 byte. Usually used as an external code. has the following advantages:
    * It is irrelevant to the CPU byte sequence and can communicate with each other on different platforms.
    * High Fault Tolerance. If any one byte is damaged, only one encoding bit will be lost at most, and no chainlock error will occur (for example, if one byte is incorrect, the entire line will be garbled)
  • UTF-16, 16-bit encoding, is a variable length code, roughly equivalent to 20-bit encoding, the value between 0 and 0x10FFFF, basically is the implementation of unicode encoding. it is a variable length code, which is related to the CPU order, but because it saves the most space, it is often used as an external code for network transmission.
    The UTF-16 is unicode preferred encoding.
  • UTF-32, uses only 32-bit encoding in the unicode range (0 to 0x10FFFF), equivalent to a subset of the UCS-4.

     

    UTF and unicode:

    Unicode is a character set and can be viewed as an internal code.
    UTF is a encoding method because unicode is not suitable for direct transmission and processing in some scenarios. UTF-16 is unicode encoding directly, no transformation, but it contains 0x00 in the encoding, the first byte of the first 256 bytecode is 0x00, in the operating system (C language) it has special significance and may cause problems. using UTF-8 encoding to convert unicode directly can avoid this problem and bring some advantages.

    Chinese national standard code:

  • GB 13000: equivalent to ISO 10646-1/Unicode 2.1. It will be changed in the future along with the standard changes of ISO 10646/Unicode.
  • GBK: Extended GB2312 to accommodate the unified Chinese Character section of Unicode 2.1 outside the GB2312 Character Set range, and added some characters not included in unicode.
  • GB 18030-2000: Based on GB 13000, as an extended version of Unicode 3.0 GBK, covering all unicode encoding, equivalent to UTF-8, UTF-16, is a form of unicode encoding. variable-length encoding, which is a single-byte, dual-byte, or 4-byte character encoding. GB18030 is backward compatible with GB2312/GBK.
    GB 18030 is mandatory for all non-handheld/Embedded Computer Systems in China.

    -------------------------------

     

    What are UCS and ISO 10646?

    International StandardsISO 10646DefinedUniversal Character Set). UCOS is a superset standard for all other character sets. it ensures bidirectional compatibility with other character sets. that is to say, if you translate any text string to the ucsformat and then translate it back to the original encoding, you will not lose any information.

    UCOS contains characters used to express all known languages. it not only describes Latin, Greek, Slavic, Hebrew, Arabic, Armenia, and Georgia, but also hieroglyphics such as Chinese, Japanese, and Korean, as well as hirakana, Katakana, and Bengali, the Gurmukhi character in the Punjabi language. kannada, Malayalam, Thai, Sichuan, Bopomofo, Hangul, Devangari, Gujarati, Oriya, Telugu, and other countless languages. for languages that have not yet been added, since they are being studied how to best encode them in computers, they will eventually be added. these languages include Tibetian, Khmer, Runic (Ancient Nordic text), Ethiopian, other hieroglyphics, and a variety of Indian-European languages, it also includes the selected artistic languages such as Tengwar, Cirth, and Klingon ). UCOS also includes a large number of graphical, printed, mathematical, and scientific symbols, including all the characters from TeX, Postscript, MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many characters provided by other word processing and publishing systems.

    ISO 10646 defines a 31-bit character set. however, in this huge encoding space, only the first 65534 code bits (0x0000 to 0 xFFFD) are allocated so far ). the 16-bit subset of this UCS is calledBasic Multilingual Plane (BMP). Characters other than 16-bit BMP are special characters (such as hieroglyphics), and they are used only by experts in the field of history and science. according to the current plan, there may no longer be any characters allocated in the future from 0x000000 to 0x10FFFF, which overwrites the 21-bit encoding space of more than 1 million potential future characters. the ISO 10646-1 standard was first published in 1993 and defines the architecture of the character set and BMP content. the second part of the character encoding other than BMP, ISO 10646-2, is being prepared, but it may take several years to complete. new characters are continuously added to BMP, but the existing characters are stable and will not be changed.

    In addition to assigning a code to each character, the UCOS also gives a formal name. it indicates the hexadecimal number of a ucs or Unicode value. Generally, "U +" is added before it, just as U + 0041 represents the character "Latin uppercase letter ". the UCS character U + 0000 to U + 007F is consistent with the US-ASCII (ISO 646), and U + 0000 to U + 00FF is consistent with ISO 8859-1 (Latin-1. from U + E000 to U + F8FF, a large range of codes other than BMP are reserved for private use.

    What is a composite character?

    Some encoding points are assignedCharacter combination. They are similar to the non-separated accent keys on the typewriter. A single character combination is not a complete character. it is a symbol similar to a heavy note or other indicator, added after the first character. therefore, a duplicate note can be added after any character. the most important characters are used in orthographies of common languages. They all have their own positions in the UCS, to ensure backward compatibility with old character sets. it has both its own encoding position and can be expressed as an aggravated character that is followed by a combination of common characters. It is calledPrecomposed characters). The pre-encoding characters in the UCS are intended to be the same as the old encoding without pre-encoding characters, such as ISO 8859, to maintain backward compatibility. the character combination mechanism allows you to add duplicate notes or other indications behind any character, which is particularly useful in scientific symbols, such as mathematical equations and international phonetic alphabet letters, you may need to combine one or more indications after a basic character.

    Character combination followed by modified characters. for example, the vowels in German ("uppercase letters A with notes") can be expressed as pre-made characters of the U + 00C4 ucscode, it can also be expressed as A common "Latin capital letter A" followed by A "combination of notes": A combination of U + 0041 U + 0308. you can use multiple composite characters when you need to Stack multiple duplicate notes or add a composite mark to and from the top and bottom of a basic character. for example, in Thai text, a basic character can contain up to two composite characters.

    What is the ucsimplementation level?

    Not all systems need to support all advanced mechanisms in the UCS such as composite characters. Therefore, ISO 10646 specifies the following three implementation levels:

    Level 1
    It does not support combination of characters and Hangul Jamo characters (a special and more complex Korean code that uses two or three sub-characters to encode a Korean syllable)
    Level 2
    Similar to level 1, but in some texts, a fixed combination of characters (for example, Hebrew, Arabic, Devangari, Bengali, Gujarati, Oriya, Tamil, telugo, India. german, Malayalam, Thai and Spanish ). without the minimum combination of characters, the UCS cannot fully express these languages.
    Level 3
    Supports all UCS. For example, a mathematician can add a tilde to any character (the tilde above the Spanish letter ~) Or an arrow (or both ).
    What is Unicode?

    Historically, there were two independent attempts to create a single character set. one is the ISO 10646 project of the International Organization for Standardization (ISO), and the other is the Unicode project organized by the association consisting of (mostly American) multilingual software manufacturers. fortunately, around 1991, participants from both projects realized that the world does not need two different single character sets. they combine the work of both parties and work together to create a single encoding table. both projects still exist and their respective standards are published independently, but Unicode Association and ISO/IEC JTC1/SC2 both agree to maintain compatibility with Unicode and ISO 10646 standard code tables, and closely adjust any future expansion.

    So what is the difference between Unicode and ISO 10646?

    Unicode standards published by the Unicode association closely include the basic multilingual aspect of ISO 10646-1 implementation level 3. In both standards, all characters are in the same position and have the same name.

    The Unicode Standard defines a number of characters-related semantic Enis, which is generally a better reference for high-quality printing and publishing systems. unicode describes in detail the algorithms used to draw expressions in certain languages (such as Arabic), the algorithms used to process bidirectional texts (such as Latin and Hebrew mixed texts), and the algorithms required to compare sorting with strings, and many other things.

    On the other hand, the ISO 10646 standard, like the well-known ISO 8859 standard, is just a simple character set table. it specifies some standards-related terms, defines some encoding aliases, and includes standard instructions, specifying how to use the UCS to connect to other ISO standards, for example, ISO 6429 and ISO 2022. there are also some closely related to ISO, for example, ISO 14651 is about the quality of string sorting in the UCS.

    Considering that the Unicode Standard has an easy-to-remember name and is included in Addison-Wesley in any good bookstore, it only takes a small part of the ISO version and includes more auxiliary information, therefore, it is not surprising that it has become a widely used reference. however, it is generally believed that the quality of the fonts used to print the ISO 10646-1 standard is higher than that used to print Unicode 2.0. professional font designers are always advised to implement both standards, but some sample fonts are significantly different. the ISO 10646-1 standard also uses four different style variants to display ideographic texts such as Chinese, Japanese, and Korean (CJK), while the Unicode 2.0 table only contains Chinese variants. this leads to the widespread belief that Unicode is unacceptable to Japanese users, despite being incorrect.

    What is UTF-8?

    First, only an integer is allocated to the character encoding table. there are several methods to represent a string of characters as a string of bytes. the two most obvious methods are to store Unicode text as strings of 2 or 4 byte sequences. the formal names of the two methods are UCS-2 and UCS-4, respectively. unless otherwise specified, most of the bytes are like this (Bigendian convention ). convert an ASCII or Latin-1 file to a UCS-2 simply insert 0x00 before each ASCII byte. to convert to UCS-4, you must insert three 0x00 before each ASCII byte.

    Using UCS-2 (or UCS-4) in Unix can cause very serious problems. the encoded strings contain special characters, such as '\ 0' or'/'. They have special meanings in the file name and other C-library function parameters. in addition, most UNIX tools that use ASCII files cannot read 16 characters without making major changes. for these reasons, in file names, text files, environment variables, and other places,UCS-2Not SuitableUnicode.

    Defined in ISO 10646-1 Annex R and RFC 2279UTF-8Encoding does not solve these problems. It is an obvious way to use Unicode in Unix-style operating systems.

    UTF-8 has a characteristic:

    • The UCS character U + 0000 to U + 007F (ASCII) is encoded as byte 0x00 to 0x7F (ASCII compatible ). this means that files containing only 7 ASCII characters are the same in both ASCII and UTF-8 encoding methods.
    • All> U + 007F UCOS characters are encoded into a string of multiple bytes, each of which has a tag set. therefore, ASCII bytes (0x00-0x7F) cannot be part of any other character.
    • The first byte of a non-ASCII multi-byte string is always in the range from 0xC0 to 0xFD, and indicates the number of bytes contained in the character. the remaining bytes of the multibyte string are in the range of 0x80 to 0 x BF. this makes re-synchronization very easy, and makes the encoding without borders, and is rarely affected by the loss of bytes.
    • Can be compiled into all possible 231 UCS code
    • In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP characters can only be up to 3 bytes long.
    • The order of the Bigendian UCS-4 byte strings is predetermined.
    • Bytes 0xFE and 0xFF are never used in UTF-8 encoding.

    The following byte string is used to indicate a character. The string used depends on the character's serial number in Unicode.

    U-00000000-U-0000007F: 0Xxxxxxx
    U-00000080-U-000007FF: 110Xxxxx10Xxxxxx
    U-00000800-U-0000FFFF: 1110Xxxx10Xxxxxx10Xxxxxx
    U-00010000-U-001FFFFF: 11110Xxx10Xxxxxx10Xxxxxx10Xxxxxx
    U-00200000-U-03FFFFFF: 111110Xx10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx
    U-04000000-U-7FFFFFFF: 1111110X10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx10Xxxxxx

    The position of xxx is filled in by the binary representation of the number of characters. the closer x is to the right, the less special it has. use only the shortest multi-byte string that is sufficient to express the number of characters encoded. note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string.

    For example: Unicode Character U + 00A9 = 1010 1001 (copyright) encoded in the UTF-8:

    11000010 10101001 = 0xC2 0xA9

    The character U + 2260 = 0010 0010 0110 0000 (not equal to) is encoded:

    11100010 10001001 10100000 = 0xE2 0x89 0xA0

    The official name of this encoding is spelled as a UTF-8, where UTF representsUCSTRansformationFOrmat. Do not use other names (such as utf8 or UTF_8) in any document to represent the UTF-8 unless you are referring to a variable name rather than the encoding itself.

    What programming languages support Unicode?

    Most modern programming languages developed around 1993 have a special data type called Unicode/ISO 10646-1 characters. In Ada95, Wide_Character is called and char is called in Java.

    Iso c also details the mechanism for processing multibyte encoding and wide characters. In September 1994, Amendment 1 to iso c added more. these mechanisms are mainly designed for various types of East Asian code, which are much more robust than what is needed to process the UCS. UTF-8 is an example of iso c Standard calling multibyte string encoding,Wchar_tType can be used to store Unicode characters.


  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.