There are local languages in every region of the world. The regional difference leads directly to the difference of language environment. In the process of developing an internationalized program, it is important to deal with language problems.
This is a world-wide problem, so Java provides a worldwide solution. The method described in this article is used to deal with Chinese, but, by extension, it is equally applicable for languages that deal with other countries and regions of the world.
The Chinese characters are double-byte. The term "Double byte" refers to the position of a double word to occupy two byte (i.e. 16 bits), which is called High and low. China's encoding for GB2312, which is mandatory, is currently supported by almost all applications that can handle Chinese language GB2312. GB2312 includes one or two-level Chinese characters and 9-zone symbols, high from 0xa1 to 0xFE, low from 0xa1 to 0xFE, where the encoding range of Chinese characters is 0xb0a1 to 0xf7fe.
There is also a code called GBK, but this is a specification, not a mandatory one. GBK provides 20,902 Chinese characters, which are compatible with GB2312, and encode range of 0x8140 to 0xfefe. All characters in the GBK can be mapped to Unicode 2.0.
In the near future, China will enact another standard: gb18030-2000 (GBK2K). It included the Tibetan, Mongolian and other minority fonts, fundamentally solve the problem of lack of character. Note: It is no longer a fixed length. The second byte part is compatible with GBK, and the four-byte portion is an expanded character, glyph. Its first and third bytes are from 0x81 to 0xFE, two bytes and fourth bytes from 0x30 to 0x39.
This article does not intend to introduce Unicode, and interested in browsing "http://www.unicode.org/" to see more information. Unicode has an attribute: it includes all character glyphs in the world. Therefore, the language of each region can establish a mapping relationship with Unicode, and Java is the use of this to achieve the conversion between different languages.
In the JDK, the Chinese-related encodings are:
Table 1 List of Chinese-related encodings in JDK
Encoding Name |
Description |
Ascii |
7-bit, same as Ascii7 |
Iso8859-1 |
8-bit, with 8859_1,iso-8859-1,iso_8859-1,latin1 ... And so the same |
Gb2312-80 |
16-bit, with gb2312,gb2312-1980,euc_cn,euccn,1381,cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB ... And so the same |
GBK |
Same as MS936, note: case sensitive |
UTF8 |
Same as UTF-8 |
GB18030 |
As with cp1392, 1392, the currently supported JDK is very small |
In actual programming, more contact is GB2312 (GBK) and iso8859-1.
Why would there be "?" Resolution
As mentioned above, the conversion between different languages is done through Unicode. Suppose there are two different languages A and B, the steps of conversion are: first convert a to Unicode and then convert Unicode to B.
An example is provided. There is a GB2312 in the Chinese character "Li", which is encoded as "c0ee" and wants to be converted into iso8859-1 encoding. The steps are: first the word "Li" into Unicode, get "674E", and then "674E" into iso8859-1 characters. Of course, this mapping will not succeed because there is no character in the iso8859-1 that corresponds to "674E".
When the mapping is unsuccessful, the problem occurs! When converting from a language to Unicode, if there is no such character in a language, the Unicode Code "\UFFFFD" ("\u" representation is Unicode encoding) is obtained. From Unicode to a language, if a language does not have a corresponding character, then the "0x3f" ("?") is obtained. )。 This is the "?" The origin.
For example: the character stream buf = "0x80 0x40 0xb0 0xa1" to the new String (buf, "gb2312") operation, the result is "\ufffd\u554a", and then println out, the result will be "? Ah", because "0x80 0x40 "is the character in the GBK, not in the GB2312.
Again, the string string= "\u00d6\u00ec\u00e9\u0046\u00bb\u00f9" to the new string (Buf.getbytes ("GBK")), and the result is " 3fa8aca8a6463fa8b4 ", of which," \u00d6 "in" GBK "no corresponding characters, get" 3f "," \U00EC "corresponds to" A8ac "," \u00e9 "corresponds to" A8a6 "," 0046 "corresponds to" 46 " (because this is ASCII character), "\U00BB" did not find, get "3f", finally, "\u00f9" corresponds to "a8b4". Println This string, the result is "ìéf?ù". Did you see that? This is not all a question mark, because the GBK and Unicode mappings have characters in addition to Chinese characters, and this example is the best proof.
Therefore, in the Chinese character transcoding, if there is confusion, it is not necessarily all the question mark Oh! However, the wrong is wrong after all, 50 steps and 100 steps and no qualitative differences.
Or ask: What happens if there are in the source character set, but not in Unicode? The answer is not to know. Because I don't have the source character set to do this test on hand. But one thing is for sure, that is, the source character set is not specification. In Java, if this happens, it throws an exception.