Java Code Conversion process
We always use a Java class file and the user for the most direct interaction (input, output), these interactive content contains text may contain Chinese. Whether these Java classes interact with the database or interact with the front-end pages, their lifecycle is always the same:
(1), programmers in the operating system through the editor to write program code and in the. Java format to save the operating system, these files we call the source file.
(2) Compiling these source files through the Javac.exe in the JDK. Class classes.
(3) Run these classes directly or deploy them in a web container to get the output.
These processes are viewed from a macroscopic perspective, and it is certainly not possible to understand this, and we need to really understand how Java is encoded and decoded:
The first step: when we use the editor to write Java source files, the program files are saved using the operating system's default encoding format (generally our Chinese operating system uses the GBK encoding format) to form a. java file. Java source files are saved using the File.encoding encoding format supported by the operating system default. The following code can view the system's file.encoding parameter values.
System.out.println (System.getproperty ("file.encoding"));
Step two: When we compile our Java file using Javac.exe, the JDK first confirms its compilation parameters encoding to determine the source code character set, and if we don't specify the compilation parameter, the JDK first obtains the operating system default file.encoding parameters, and the JDK then writes our J Ava source program from the file.encoding encoding format into the Java internal default Unicode format into memory.
Step three: The JDK compiles the above and saves the information in memory to the class file to form a. class file. At this point, the. class file is Unicode encoded, which means that the contents of our common. class file, whether in Chinese or English characters, have been converted to Unicode encoding format.
In this step there is a bit different about how the JSP source files are handled: The Web container invokes the JSP compiler, and the JSP compiler first looks at whether the JSP file has a file encoding format. If not set, the JSP compiler invokes the calling JDK to convert the JSP file to a temporary servlet class using the default encoding, and then compiles it to the. class file and remains in the temporary folder.
Step Fourth: Run the compiled class: there are a few situations here
(1), run directly on the console.
(2), Jsp/servlet class.
(3), Java class and database.
Each of these three situations will be different in every way,
classes running on the 1.Console
in this case, the JVM first reads the class file stored in the operating system into memory, where the memory class file is encoded in Unicode and the JVM runs it. If user input information is required, the information entered by the user is encoded in the File.encoding encoding format and converted to the Unicode encoding format for storage in memory. After the program is run, the resulting results are then converted to the file.encoding format to return to the operating system and output to the interface. The whole process is as follows:
In the above whole process, all involved in the code conversion can not be wrong, otherwise will produce garbled.
2.Servlet Class
because the JSP file will eventually be converted to a servlet file (except that it is stored in a different location), we also include the JSP file here.
When a user requests a servlet, the Web container invokes its JVM to run the servlet. First, the JVM loads the class of the servlet into memory, and the servlet code in memory is in Unicode format. The JVM then runs the servlet in memory, if you need to accept data passed from the client (such as forms and URLs) during the run, the Web container will accept incoming data, and if the program sets the encoding for the incoming parameter in the receiving process, the coded format is used. If not set, the default iso-8859-1 encoding format is used, and after the data is received the JVM converts the data to Unicode and into memory. When you run the servlet, the output results are generated, and the output is encoded in Unicode. Immediately thereafter, the Web container sends a string of the resulting Unicode encoded format directly to the client, and if the program specifies the encoding format for the output, it is exported to the browser in the specified encoding format, otherwise the default iso-8859-1 encoding format is used. The process flow chart is as follows:
3. The database section
We know that Java programs are connected to the database through the JDBC driver, and the JDBC driver defaults to the ISO-8859-1 encoding format, which means that when we pass data to the database through a Java program, JDBC first converts data in Unicode encoding to iso-8859-1 encoding, and then in the database, where the data is stored in the database, the default format is iso-8859-1.
II. Coding & Decoding
the following will end Java in those situations need to encode and decode operations, and detailed sequence of the middle process, further grasp the Java coding and decoding process. There are four main scenarios in Java that require encoding and decoding operations:
(1): I/O operation
(2): Memory
(3): Database
(4): Javaweb
The following main introduction of the previous two scenarios, the database section as long as the correct encoding format will not have any problems, javaweb scene too much need to understand the URL, get, post encoding, servlet decoding, so javaweb scene under the LZ introduction.
1.I/O operation
LZ mentioned in front of the garbled problem is nothing more than the code format in the process of the generation of the unified, for example, the encoding using UTF-8, decoding using GBK, but the most fundamental reason is that the character to byte or byte to character conversion problems, and the conversion of this situation is the most important scene is the i/ o The time of operation. Of course I/O operations mainly include network I/O (i.e. javaweb) and disk I/O. Network I/O is described in the next section.
First we look at the I/O coding operation.
InputStream is the superclass of all classes of the byte input stream, and the reader is an abstract class that reads the streams of characters. Java reads files in a way that reads by byte streams and reads by character Stream, where InputStream and reader are the superclass of the two ways of reading.
by byte
We generally use the Inputstream.read () method to read bytes in the stream (read () reads only one byte at a time, is very slow, we generally use Read (byte[]), and then is saved in a byte[] array, Last converted to string. When we read the file, the encoding that reads the byte depends on the encoding format used by the file, and the encoding is involved in converting to string, which can be problematic if the encoding format is different. For example, there is a problem test.txt encoded format for UTF-8, the data stream encoded format that is obtained when reading a file through a byte stream is UTF-8, and if we do not specify the encoding format when converting to string, the system encoding format (GBK) is used to decode the operation by default, because the encoding format is inconsistent, then the The construction of string procedures is bound to produce garbled characters, as follows:
File File = new file ("C:\\Test.txt");
InputStream input = new FileInputStream (file);
StringBuffer buffer = new StringBuffer ();
byte[] bytes = new byte[1024];
for (int n; (n = input.read (bytes))!=-1; ) {
buffer.append (new String (Bytes,0,n));
}
SYSTEM.OUT.PRINTLN (buffer);
The result of the output is garbled ....
The contents of the Test.txt are: I am cm.
To avoid garbled characters, specify the encoding format during the construction of string so that the encoding is consistent in both encoding and decoding:
Buffer.append (New String (Bytes,0,n, "UTF-8"));
by character
In fact, a character stream can be viewed as a wrapper flow, which reads bytes at the bottom or byte stream, and then decodes the read bytes into characters using the specified encoding. In Java, Reader is a superclass that reads character streams. So it's no different from the bottom up to read the file by byte and read by character. At read time the character reads every time a byte is read, and the stream reads one byte at a time.
Bytes & Character Conversions
Byte conversion to character must be inputstreamreader. The API is interpreted as follows: InputStreamReader is a bridge of byte flow to character streams: it reads bytes using the specified charset and decodes them into characters. The character set it uses can be specified by name or explicitly given, or the platform default character set can be accepted. Each call to a read () method in InputStreamReader causes one or more bytes to be read from the underlying input stream. To enable a valid conversion from byte to character, you can read more bytes from the underlying stream ahead of time so that it exceeds the byte needed to satisfy the current read operation. The API interpretation is very clear, InputStreamReader still reads the file at the bottom while reading the byte, it needs to parse the character according to a specified encoding format, and the system default encoding format is used if no encoding format is specified.
String file = "C:\\Test.txt";
String charset = "UTF-8";
Write character swap to byte stream
fileoutputstream outputstream = new FileOutputStream (file);
OutputStreamWriter writer = new OutputStreamWriter (OutputStream, CharSet);
try {
Writer.write ("I am cm");
} finally {
writer.close ();
}
Read bytes convert to character
fileinputstream InputStream = new FileInputStream (file);
InputStreamReader reader = new InputStreamReader (
InputStream, CharSet);
StringBuffer buffer = new StringBuffer ();
char[] buf = new char[64];
int count = 0;
try {
while (count = Reader.read (BUF))!=-1) {
buffer.append (buf, 0, Count);
}
} finally {
rea Der.close ();
}
SYSTEM.OUT.PRINTLN (buffer);
2. Memory
first, let's look at the following simple code
String s = "I am cm";
byte[] bytes = S.getbytes ();
string S1 = new String (bytes, "GBK");
String s2 = new string (bytes);
In this code we see three coded conversion processes (one encoding, two decoding). First look at String.gettytes ():
Public byte[] GetBytes () {return
Stringcoding.encode (value, 0, value.length);
}
Internal call Stringcoding.encode () method action:
Static byte[] Encode (char[] CA, int off, int len) {
String CSN = Charset.defaultcharset (). Name ();
try {
//Use charset name Encode () variant which provides caching.
Return encode (CSN, CA, off, Len);
} catch (Unsupportedencodingexception x) {
warnunsupportedcharset (CSN);
}
try {return
encode (' Iso-8859-1 ', CA, off, len);
\ catch (Unsupportedencodingexception x) {
//If This code Is hit during VM initialization, messageutils are//The only way we'll be able to have any
kind of error MESSAGE.
messageutils.err ("Iso-8859-1 CharSet not available:"
+ x.tostring ());
If we can not find iso-8859-1 (a required encoding) then things
//are seriously wrong with the installation.
system.exit (1);
return null;
}
}
The Encode (char[] paramarrayofchar, int paramInt1, int paramInt2) method first invokes the system's default encoding format, and if no encoding format is specified, the encoding is done by default using the ISO-8859-1 encoding format. Further in-depth as follows:
String CSN = (CharsetName = null)? "Iso-8859-1": charsetname;
The same method can be seen within the constructor of the new String to invoke the Stringcoding.decode () method:
Public String (byte bytes[], int offset, int length, Charset Charset) {
if (Charset = = null)
throw new Nullpointere Xception ("CharSet");
Checkbounds (bytes, offset, length);
This.value = Stringcoding.decode (charset, Bytes, offset, length);
}
The Decode method and the encode process are the same for the encoding format.
For the above two situations we only need to set a unified coding format generally will not produce garbled problem.
3. Coding & Coding Format
First look at the Java Coding Class diagram
The Chartset class is set first according to the specified chart, then the Chartsetencoder object is created according to Chartset, and then the Charsetencoder.encode encodes the string, and the different encoding types correspond to a class. The actual coding process is done in these classes. The sequence diagram below shows a detailed coding process:
Through this coded class diagram and sequence diagram, we can understand the detailed process of encoding. The following is a simple code for iso-8859-1, GBK, UTF-8 encoding
public class Test02 {public
static void Main (string[] args) throws Unsupportedencodingexception {
string string = "I am cm";
Test02.printchart (String.tochararray ());
Test02.printchart (String.getbytes ("iso-8859-1"));
Test02.printchart (String.getbytes ("GBK"));
Test02.printchart (String.getbytes ("UTF-8"));
}
/**
* Char converts to 16/public
static void Printchart (char[] chars) {for
(int i = 0; i < chars.length; i++) {
System.out.print (integer.tohexstring (chars[i]) + "");
}
System.out.println ("");
}
/**
* Byte converts to 16/public
static void Printchart (byte[] bytes) {for
(int i = 0; i < bytes.length ; i++) {
String hex = integer.tohexstring (bytes[i) & 0xFF);
if (hex.length () = = 1) {
hex = ' 0 ' + hex;
}
System.out.print (Hex.touppercase () + "");
}
System.out.println ("");
}
}
Output:
6211 662f 6d
3F 3F
D2 6D CE CA C7 6D
E6 E6 (AF) 6D
The result of the program we can see "I am cm" is:
char[]:6211 662f 6d
iso-8859-1:3f 3F to 6D
gbk:ce D2 CA C7 (6D utf-8:e6) E6 (
AF) 6D
The figure is as follows: