JAVA character encoding Series II: Unicode, ISO-8859, GBK, UTF-8 encoding and mutual conversion

Source: Internet
Author: User
Article 2: JAVA character encoding Series II: Unicode, ISO-8859-1, GBK, UTF-8 encoding and mutual conversion  1. Function IntroductionIn Java, a string is encoded in Unicode. Each character occupies two bytes. The two major functions related to encoding are: 1) parse the string into a byte array using the specified encoding set, unicode-> charsetName conversion public byte [] getBytes (String charsetName) throws UnsupportedEncodingException 2) the byte array is constructed into a String using the specified encoding set, charsetName-> Unicode conversion public String (byte [] bytes, String charsetName) throws UnsupportedEncodingException 2. Direct conversion between Unicode and various encodingsThe following uses the encoding conversion of the Chinese character string "a Chinese" as an example to understand the conversion between various encodings. 1) Unicode and GBK test results are as follows, it is reversible, that is, the byte can be converted back to the String-GBK> ByteArray: \ u0061 \ u4E2D \ u6587 (a Chinese)-> 0x61 0xD6 0xD0 0xCE 0xC4ByteArray-GBK> String: 0x61 0xD6 0xD0 0xCE 0xC4-> \ u0061 \ u4E2D \ u6587 (a Chinese) 2) Unicode and UTF-8 test results are as follows, each Chinese character is converted to three bytes, and is reversible, that is, the byte can be converted back to the string String-UTF-8> ByteArray: \ u0061 \ u4E2D \ u6587 (a Chinese) -> 0x61 0xE4 0xB8 0xAD 0xE6% 0x96 0x87ByteArray-UTF-8> String: 0x61 0xE4 0xB8 0x AD 0xE6% 0x96 0x87-> \ u0061 \ u4E2D \ u6587 (a Chinese) 3) Unicode and ISO-8859-1 test results are as follows: conversion failure, non-reversible, that is, the byte can no longer be converted back to the String String-ISO-8859-1> ByteArray: \ u0061 \ u4E2D \ u6587 (a Chinese)-> 0x61 0x3F 0x3FByteArray-ISO-8859-1> String: 0x61 0x3F 0x3F-> \ u0061 \ u003F \ u003F (??) 3. Cross conversion between Unicode and each EncodingIn the preceding direct conversion, the byte array generated by the string (Unicode) uses the correct encoding set when constructing the return string, what if the correct encoding set is used? Will it be constructed correctly? Can I recover it if it cannot be correctly constructed? Will the information be lost? Next let's take a look at this situation. This part shows that in some cases, although we finally correctly display the results, the conversion is still incorrect. 1) can correctly display the intermediate incorrect conversion we know that String-GBK> ByteArray-GBK> String is correct, but what if we use String-GBK> ByteArray-ISO-8859-1> String? Test results: String-GBK> ByteArray-ISO-8859-1> String: \ u0061 \ u4E2D \ u6587 (a Chinese) -> 0x61 0xD6 0xD0 0xCE 0xC4-> \ u0061 \ u00D6 \ u00D0 \ u00CE \ u00C4 (????) In this case, the string we get is? Garbled "????", But by continuing the conversion we can still restore the correct String "a Chinese", the process is as follows: String-GBK> ByteArray-ISO-8859-1> String-ISO-8859-1> ByteArray-GBK> String correspondence: \ u0061 \ u4E2D \ u6587 (a Chinese)-> 0x61 0xD6 0xD0 0xCE 0xC4-> \ u0061 \ u00D6 \ u00D0 \ u00CE \ u00C4 (????) -> 0x61 0xD6 0xD0 0xCE 0xC4-> \ u0061 \ u4E2D \ u6587 (a Chinese), that is, when we construct a string for the first time, we use the wrong encoding set to get the wrong garbled code, but we add an error in the error, use the wrong encoding set to get the byte array, and then construct it with the correct encoding set, the correct string is restored. In this case, it is "Incorrect conversion in the middle that can be correctly displayed ". This often happens when submitting data processing on Jsp pages. In addition, the intermediate incorrect conversions that are correctly displayed are: String-UTF-8> ByteArray-ISO-8859-1> String-ISO-8859-1> ByteArray-UTF-8> String and String-UTF-8> ByteArray-GBK> String-GBK> ByteArray-UTF-8> String: \ u0061 \ u4E2D \ u6587 (a Chinese)-> 0x61 0xE4 0xB8 0xAD 0xE6% 0x96 0x87-> \ u0061 \ u6D93 \ uE15F \ u6783 (a Juan) -> 0x61 0xE4 0xB8 0xAD 0xE6% 0x96 0x87-> \ u0061 \ u4E2D \ u6587 (a Chinese) 4. error diagnosis reference during encoding1) a Chinese character corresponds to a question mark in the ISO-8859-1 from the string to obtain the byte array, because a Unicode converted into a byte, when encountered unknown Unicode, convert to 0x3F, in this way, no matter which encoding is used for construction,? Garbled. 2) A Chinese Character corresponds to two question marks when getting a byte array from the string through GBK, because a Unicode is converted to two bytes, if at this time with the ISO-8859-1 or using the UTF-8 to construct the string will have two question marks. If it is constructed through a ISO-8859-1, it can be restored by adding an error through the above-mentioned error (that is, it is resolved from the ISO-8859-1 and constructed with GBK ); if it is constructed through a UTF-8, it will produce the Unicode Character "\ uFFFD", can not be restored, if again through the String-UTF-8> ByteArray-GBK> String, there will be a complex code, such as a small copy of Three) A Chinese Character pair should contain three question marks when obtaining the byte array from the string through the UTF-8, because a Unicode is converted to three bytes, if at this time using the ISO-8859-1 to construct the string will appear three question marks; if you use GBK to construct a string, a complex code will appear, such as a Juan.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.