In the previous blog, LZ elaborated the Java each channel transcoding process, elaborated Java in the operation process those steps in carries on the transcoding, in these transcoding process if one place problem is likely to produce garbled! The following LZ describes how Java is encoded and decoded during transcoding.
Encoding & decoding
In the previous blog, LZ described the three channels of the encoding conversion process, the following LZ will end the Java in those occasions need to encode and decode operations, and detailed ordering of the middle process, further mastering the Java encoding and decoding process. There are four main scenarios in Java that require encoding and decoding operations:
1:I/O operation
2: Memory
3: Database
4:javaweb
The following main introduction of the previous two scenarios, the database part as long as the correct encoding format will not have any problem, javaweb scenes too much need to understand the URL, get, post encoding, servlet decoding, so javaweb scene under section LZ introduction.
I/O operations
In front of the LZ on the garbled problem is nothing more than the transcoding process in the encoding format of the non-uniform generation, such as the use of UTF-8 encoding, decoding using GBK, but the most fundamental reason is the character to byte or byte-to-character conversion problem, and this situation of the most important scenario is the time I/O operation. Of course I/O operations mainly include network I/O (i.e. javaweb) and disk I/O. The following section describes the network I/O.
First we look at the I/O encoding operation.
InputStream is a superclass of all classes that are input streams of bytes, and reader is an abstract class that reads a stream of characters. The way Java reads files is divided into bytes read by byte stream and read by character Stream, in which InputStream, reader is the super class of the two reading methods.
by byte
We generally use the Inputstream.read () method to read bytes in the data stream (read () reads only one byte at a time, is very slow, we generally use Read (byte[])), and then is stored in a byte[] array, Last converted to string. When we read a file, the encoding of the read byte depends on the encoding format used by the file, and the problem of encoding is also involved in converting to string, which can occur if the encoding format is different between the two. For example, there is a problem test.txt encoding format is UTF-8, then read the file by the byte stream encoding format is UTF-8, and we in the conversion into a string process if you do not specify the encoding format, the default use System encoding format (GBK) to decode the operation, because the two encoding format is inconsistent, then the Constructing a string process will certainly produce garbled characters, as follows:
File File = new file ("C:\\Test.txt"); InputStream input = new FileInputStream (file); StringBuffer buffer = new StringBuffer (); byte[] bytes = new byte[1024]; for (int n; (n = input.read (bytes))!=-1; ) { buffer.append (new String (Bytes,0,n)); } SYSTEM.OUT.PRINTLN (buffer);
Output Result: Nobelium an display? cm
The content in Test.txt is: I am cm.
To avoid garbled characters, specify the encoding format during string construction so that the encoding is decoded in a consistent way:
Buffer.append (New String (Bytes,0,n, "UTF-8"));
by character
In fact, a character stream can be seen as a wrapper stream, where the underlying or byte stream is used to read bytes, and then it decodes the read byte into characters using the specified encoding. In Java, Reader is a superclass that reads a stream of characters. So from the bottom up, there's no difference between reading a file by byte and reading by character. At the time of reading, the character reads each time a byte is read, and the byte stream reads one bytes at a time.
BYTE & character conversions
bytes must be inputstreamreader to convert to characters. The API is interpreted as follows: InputStreamReader is a bridge of byte flow to a character stream: it reads bytes and decodes them to characters using the specified charset
. The character set it uses can be specified or explicitly given by name, or it can accept the default character set of the platform. Each call to a read () method in InputStreamReader causes one or more bytes to be read from the underlying input stream. To enable a valid conversion from byte to character, you can read more bytes in advance from the underlying stream to exceed the bytes required to satisfy the current read operation. The API explanation is very clear, InputStreamReader read the file at the bottom of the byte read, after reading the byte it needs to be based on a specified encoding format to resolve to a character, if no encoding format is specified in the system default encoding format.
String file = "C:\\Test.txt"; String charset = "UTF-8"; Write character swap to byte stream fileoutputstream outputstream = new FileOutputStream (file); OutputStreamWriter writer = new OutputStreamWriter (OutputStream, CharSet); try { Writer.write ("I am cm"); } finally { writer.close (); } Read bytes converted into characters fileinputstream InputStream = new FileInputStream (file); InputStreamReader reader = new InputStreamReader ( InputStream, CharSet); StringBuffer buffer = new StringBuffer (); char[] buf = new char[64]; int count = 0; try {while ((count = Reader.read (BUF))! =-1) { buffer.append (buf, 0, Count);} } finally { rea Der.close (); } SYSTEM.OUT.PRINTLN (buffer);
Memory
First, let's take a look at this simple piece of code
String s = "I am cm"; byte[] bytes = S.getbytes (); string S1 = new String (bytes, "GBK"); String s2 = new string (bytes);
In this code we see three coding conversions (one-time code, two decoding). First look at String.gettytes ():
Public byte[] GetBytes () { return Stringcoding.encode (value, 0, value.length); }
internally calls the Stringcoding.encode () method action:
Static byte[] Encode (char[] CA, int off, int len) {String CSN = Charset.defaultcharset (). Name (); try {//Use charset name Encode () variant which provides caching. Return encode (CSN, CA, off, Len); } catch (Unsupportedencodingexception x) {warnunsupportedcharset (CSN); } try {return encode ("Iso-8859-1", CA, off, Len); } catch (Unsupportedencodingexception x) {//If This code was hit during VM initialization, Messageutils is The only-to we'll be able to get any kind of error message. Messageutils.err ("Iso-8859-1 CharSet not available:" + x.tostring ()); If we can not find iso-8859-1 (a required encoding) then things//is seriously wrong with the installation . System.exit (1); return null; } }
The Encode (char[] paramarrayofchar, int paramInt1, int paramInt2) method first calls the system's default encoding format and, if no encoding format is specified, uses the ISO-8859-1 encoding format for encoding operations by default. Further in-depth as follows:
String CSN = (CharsetName = = null)? "Iso-8859-1": charsetname;
The same method can be seen inside the constructor of the new String called the Stringcoding.decode () method:
Public String (byte bytes[], int offset, int length, Charset Charset) { if (Charset = = null) throw new Nullpointere Xception ("CharSet"); Checkbounds (bytes, offset, length); This.value = stringcoding.decode (charset, Bytes, offset, length); }
The Decode method and the encode process are the same for the encoding format.
For the above two cases we only need to set a uniform encoding format generally does not produce garbled problems.
Encoding & Encoding Formats
First look at the Java Coding Class diagram [1]
First, the Chartset class is set according to the specified chart, then the Chartsetencoder object is created according to Chartset, and then the Charsetencoder.encode is called to encode the string, and the different encoding types correspond to a class. The actual encoding process is done in these classes. The following timing diagram shows the detailed coding process:
this coded class diagram and time series diagram allows you to understand the detailed coding process. The following is a simple code for iso-8859-1, GBK, UTF-8 encoding
public class Test02 {public static void main (string[] args) throws Unsupportedencodingexception {string string = "I am cm"; Test02.printchart (String.tochararray ()); Test02.printchart (String.getbytes ("iso-8859-1")); Test02.printchart (String.getbytes ("GBK")); Test02.printchart (String.getbytes ("UTF-8")); The/** * Char is converted to 16 binary */public static void Printchart (char[] chars) {for (int i = 0; i < chars. length; i++) {System.out.print (integer.tohexstring (chars[i]) + ""); } System.out.println (""); The/** * byte is converted to 16 binary */public static void Printchart (byte[] bytes) {for (int i = 0; i < bytes. length; i++) {String hex = integer.tohexstring (Bytes[i] & 0xFF); if (hex.length () = = 1) {hex = ' 0 ' + hex; } System.out.print (Hex.touppercase () + ""); } System.out.println (""); }}-------------------------output:6211 662f 6d 3F 3F-D2 6D CE-CA C7-E6 6D E6 98
Through the program we can see that the result of "I am cm" is:
char[]:6211 662f 6d
iso-8859-1:3f 3F 6D
Gbk:ce D2 CA C7 6D
Utf-8:e6 E6 98 AF 6D
The figure is as follows:
More & References
For both scenarios we only need to set a consistent correct encoding generally does not produce garbled problem, through the LZ above the elaboration of the Java encoding decoding process should have a clearer understanding. In fact, the main scene of garbled in Java is in the Javaweb, so the LZ next blog to explain the Javaweb in the case of garbled.
1. Analysis and solution of Chinese character problem in Java programming technology: http://www.ibm.com/developerworks/cn/java/java_chinese/.
-----Original from: http://cmsblogs.com/?p=1491, please respect the author's hard work results, reproduced the source of the explanation.
-----Personal site: http://cmsblogs.com
Java Chinese garbled solution (v)-----How Java is encoded and decoded