In the past two days, I took the time to summarize/sort out the actual encoding methods and usage of various encodings in Java applications. I will record them here for future reference. In order to form a complete understanding and in-depth understanding of text encoding, in order to deal with various problems encountered during Java development, especially the garbled problem, I think it is better to make up a series to describe and analyze, including three articles: First Article: Java character encoding Series 1: Unicode, GBK, gb2312, UTF-8 concept basics Article 2: java character encoding Series II: Unicode, ISO-8859, GBK, UTF-8 encoding and mutual conversion Article 3: Java character encoding Series III: coding problems in Java applications Article 1: Java character encoding series I: unicode, GBK, gb2312, UTF-8 concept basis this part is reused, repost an article to achieve this part of the goal. Source: holen 'blog character encoding and Unicode, ISO 10646, UCS, utf8, UTF16, GBK, gb2312 understanding address: http://blog.donews.com/holen/archive/2004/11/30/188182.aspx
The encoding mechanism developed by unicode.org should include common texts all over the world.
In 1.0, It is a 16-bit code, from u + 0000 to U + FFFF. each 2byte Code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit is used as the basic bit plane, and the 16-bit plane is added, which is equivalent to 20-bit encoding, the encoding range is 0 to 0x10ffff.
The universal character set defined in iso000046 according to ISO, which adopts 4 Byte encoding.
GB 18030-2000: Based on GB 13000, which is an extended GBK version of Unicode 3.0 and covers all unicode encoding,
Position equivalent to UTF-8, UTF-16, is a unicode encoding form. Variable Length Encoding, with single/double/4 bytes of character encoding.
Gb18030 is backward compatible with gb2312/GBK.
GB 18030 is mandatory for all non-handheld/Embedded Computer Systems in China.-------------------------------
What are UCs and ISO 10646?International StandardsISO 10646
DefinedUniversal Character Set)
. UCOS is a superset standard for all other character sets. it ensures bidirectional compatibility with other character sets. that is to say, if you translate any text string to the ucsformat and then translate it back to the original encoding, you will not lose any information.
UCOS contains characters used to express all known languages, including not only Latin, Greek, Slavic, Hebrew, Arabic, Armenia, and Georgia,
It also includes Chinese, Japanese, and Korean hieroglyphics, as well as hirakana, Katakana, Bengali, Punjabi gurmukhi, Tamil,
Kannada, Malayalam, Thai, Japanese, bopomofo, hangul,
Devangari, Gujarati, Oriya, Telugu, and other countless languages,
Since they are studying how to best encode them in computers, they will eventually be added. These languages include Tibetian, Khmer, runic (Ancient Nordic text ),
Ethiopian, other hieroglyphics, and a variety of Indian-European languages, including the selected artistic languages such as tengwar, cirth, and
Klingon. UCS also contains a large number of graphical, printed, mathematical, and scientific symbols, including all the symbols from Tex, postscript,
MS-DOS, MS-Windows, Macintosh, OCR fonts, and many other word processing and publishing systems provide characters.
ISO 10646 defines a 31-bit character set. however, in this huge encoding space, only the first 65534 code bits (0x0000 to 0 xfffd) are allocated so far ). the 16-bit subset of this UCS is calledBasic multilingual plane (BMP)
.
Characters other than 16-bit BMP are special characters (such as hieroglyphics), and they are used only by experts in the field of history and science. according to the current plan,
In the future, there may no longer be any characters allocated from 0x000000 to 0x10ffff, which overwrites 21 of more than 1 million potential future characters.
The first release of the ISO 10646-1 standard in 1993 defines the architecture of the character set and BMP content.
The second part of the character encoding except ISO 10646-2 is being prepared, but it may take several years to complete. New characters are continuously added to BMP,
However, the existing characters are stable and will not be changed.
In addition to assigning a code to each character, the UCOS also gives a formal name to indicate the hexadecimal number of a UCOS or Unicode value. Generally
"U +", just as U + 0041 represents the character "Latin uppercase letter A". UCOS character U + 0000 to U + 007f with US-ASCII (ISO
646) is consistent, U + 0000 to U + 00FF and ISO 8859-1 (Latin-1) is also consistent. From U + e000
U + f8ff, a large range of encoding other than BMP is reserved for private use.
What is a composite character?Some encoding points are assignedCharacter combination
They are similar to the non-separated accent keys on the typewriter.
A single character combination is not a complete character. It is a symbol similar to a heavy note or other indicator. It is added after the first character. Therefore, a heavy note can be added after any character.
The most important characters are the ones used in orthographies of common languages.
The UCOS has its own position to ensure backward compatibility with the old character set. It can both have its own encoding position and can be expressed as an aggravated character with a common character followed by a combination of characters,
CalledPrecomposed characters)
. UCS
To maintain backward compatibility with old encoding, such as ISO 8859.
The character combination mechanism allows you to add duplicate notes or other indications behind any character, which is particularly useful in scientific symbols, such as mathematical equations and international phonetic alphabet letters,
You may need to combine one or more indications after a basic character.
Character combination followed by a modified character. For example, a vowels in German ("uppercase Latin letter A plus note") can be expressed as a U + 00c4
Can also be expressed as a common "Latin capital letter A" followed by a "combination of notes": U + 0041 U + 0308 combination.
You can use multiple composite characters when you need to Stack multiple duplicate notes or add a composite mark to and from the top and bottom of a basic character. For example, in Thai text,
A basic character can contain up to two character combinations.
What is the ucsimplementation level?Not all systems need to support all advanced mechanisms in the UCS such as composite characters. Therefore, ISO 10646 specifies the following three implementation levels:
-
Level 1
-
It does not support combination of characters and hangul jamo characters (a special and more complex Korean code that uses two or three sub-characters to encode a Korean syllable)
-
Level 2
-
Class
Similar to level 1, but in some texts, a fixed combination of characters (for example, Hebrew, Arabic, devangari, Bengali, guormuqi,
Gujarati, Oriya, telugo, Indian German, Malayalam, Thai and Japanese ).
Without the minimum combination of characters, the UCS cannot fully express these languages.
-
Level 3
-
Supports all UCS. For example, a mathematician can add a Tilde to any character (the Tilde above the Spanish letter ~) Or an arrow (or both ).
What is Unicode?In history, there have been two independent attempts to create a single character set. One is the International Organization for Standardization (ISO)
The ISO 10646 project, and the Unicode project organized by the association consisting of (mostly American) multilingual software manufacturers
.
Fortunately, around 1991, participants of both projects realized that the world did not need two different single character sets. They combined the work of both parties,
Work collaboratively to create a single encoding table. Both projects still exist and independently publish their respective standards, But Unicode Association and ISO/IEC JTC1/SC2
Both agree to maintain compatibility with Unicode and ISO 10646 code tables and closely adjust any future extensions.
So what is the difference between Unicode and ISO 10646?Unicode standards published by the Unicode Association
Strictly contains the basic multilingual aspect of ISO 10646-1 implementation level 3. In both standards, all characters are in the same position and have the same name.
The Unicode Standard defines a number of characters-related semantic Enis, which is generally a better reference for high-quality printing and publishing systems. Unicode
Describes in detail the algorithm used to draw expressions in certain languages (such as Arabic), the algorithm used to process bidirectional texts (such as Latin and Hebrew mixed texts), and the algorithm used to compare sorting with strings,
And many other things.
On the other hand, the ISO 10646 standard, like the well-known ISO 8859 standard, is just a simple character set table.
It specifies some standards-related terms, defines some encoding aliases, and includes standard instructions, specifying how to use the UCS to connect to other ISO standards, such
ISO 6429 and ISO 2022. There are also some closely related to ISO. For example, ISO 14651 is about the string sorting of UCS.
Considering that the Unicode Standard has an easy-to-remember name and is available in Addison-Wesley in any good bookstore, only ISO
A small part of the version, including more auxiliary information, makes it a much more widely used reference. However, it is generally considered that it is used to print ISO 10646-1
Standard fonts have higher quality in some aspects than Unicode 2.0 printing. Professional font designers are always advised to implement both standards,
However, there are significant differences in some provided sample fonts. The ISO 10646-1 standard also uses four different style variants to display ideographic texts such as Chinese, Japanese, and Korean (CJK ).
Unicode 2.0 has only a Chinese variant in the table, which leads to the legend that Unicode is unacceptable to Japanese users, despite being incorrect.
What is UTF-8?First, only an integer is allocated to the character encoding table for the UCs and Unicode. There are several methods to represent a string of characters as a string of bytes. The most obvious two methods are
Unicode text stores strings of 2 or 4 byte sequences. The formal names of these two methods are UCS-2 and UCS-4, respectively. Unless otherwise specified,
Otherwise most of the bytes are like this (bigendian Convention). convert an ascii or Latin-1 file to a UCS-2
Simply insert 0x00 before each ASCII byte. to convert to a UCS-4, you must insert three 0x00 before each ASCII byte.
Using UCS-2 (or UCS-4) in UNIX can cause very serious problems. Strings encoded with these will contain special characters, such as '/0'
Or '/', which have special meanings in file names and other C-library function parameters. In addition, most UNIX tools that use ASCII files,
If you do not make any major modification, you cannot read 16 characters. For these reasons, in the file name, text file, environment variable, and other places,UCS-2
Not SuitableUnicode
.
In ISO 10646-1 Annex R
And RFC 2279
Defined inUTF-8
Encoding does not solve these problems. It is an obvious way to use Unicode in Unix-style operating systems.
UTF-8 has a characteristic:
- The UCS character U + 0000 to U + 007f (ASCII) is encoded as byte 0x00 to 0x7f (ASCII compatible ). this means that files containing only 7 ASCII characters are the same in both ASCII and UTF-8 encoding methods.
- All> U + 007f UCOS characters are encoded into a string of multiple bytes, each of which has a tag set. therefore, ASCII bytes (0x00-0x7f) cannot be part of any other character.
- The first byte of a non-ASCII multi-byte string is always in the range from 0xc0 to 0xfd, and indicates the number of bytes contained in the character. the remaining bytes of the multibyte string are in the range of 0x80 to 0 x BF. this makes re-synchronization very easy, and makes the encoding without borders, and is rarely affected by the loss of bytes.
- Can be compiled into all possible 231
UCs code
- In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP characters can only be up to 3 bytes long.
- The order of the bigendian UCS-4 byte strings is predetermined.
- Bytes 0xfe and 0xff are never used in UTF-8 encoding.
The following byte string is used to indicate a character. The string used depends on the character's serial number in Unicode.
U-00000000-U-0000007F: |
0Xxxxxxx |
U-00000080-U-000007FF: |
110XXXXX 10Xxxxxx |
U-00000800-U-0000FFFF: |
1110Xxxx 10Xxxxxx 10Xxxxxx |
U-00010000-U-001FFFFF: |
11110Xxx 10Xxxxxx 10Xxxxxx 10Xxxxxx |
U-00200000-U-03FFFFFF: |
111110Xx 10Xxxxxx 10Xxxxxx 10Xxxxxx 10Xxxxxx |
U-04000000-U-7FFFFFFF: |
1111110X 10Xxxxxx 10Xxxxxx 10Xxxxxx 10Xxxxxx 10Xxxxxx |
The position of XXX is filled in by the binary representation of the number of characters. the closer X is to the right, the less special it has. use only the shortest multi-byte string that is sufficient to express the number of characters encoded. note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string.
For example
: Unicode Character U + 00a9 = 1010 1001 (copyright) encoded in the UTF-8:
11000010 10101001 = 0xc2 0xa9
The character U + 2260 = 0010 0010 0110 0000 (not equal to) is encoded:
11100010 10001001 10100000 = 0xe2 0x89 0xa0
The official name of this encoding is spelled as a UTF-8, where UTF representsU
CST
RansformationF
Ormat. Do not use other names (such as utf8 or utf_8) in any document to represent the UTF-8 unless you are referring to a variable name rather than the encoding itself.
What programming languages support Unicode?Most modern programming languages developed around 1993 have a special data type called Unicode/ISO 10646-1 characters. In ada95, wide_character is called and char is called in Java.
Iso c also details the mechanism for processing multibyte encoding and wide characters. In September 1994, Amendment 1 to ISO C
Added more when posting. these mechanisms are mainly designed for various types of East Asian code, which are much more robust than what is needed to process the UCS. UTF-8 is an example of Iso c Standard calling multibyte string encoding,Wchar_t
Type can be used to store Unicode characters.