I often encounter java coding problems at work. Due to lack of research, I cannot give a definite answer. I checked some information online this weekend and made a summary here.
Question 1: What encoding should I use to read files in java?
Java can read files in two types: byte and character. Byte reading uses InputStream. read () method to read the byte, and then save it to a byte [] array. new String (byte []) is often used at last, and the byte array is converted to a String. The last step hides the encoding details. new String (byte []) uses the default Character Set of the operating system to decode the byte array. The Chinese operating system is GBK. The Bytes we read from the input stream are probably not GBK encoded, because the byte encoding read from the input stream depends on the encoding of the file to be read. For example, we create a new file named demo.txt on the d: disk and write "we .", And save. Includemo.txt is encoded as ANSI, and GBK is used in the Chinese operating system. In this case, the byte obtained by reading the file from the input byte stream is the byte encoded in GBK mode. Then we finally use the new String (byte []); the default GBK of the platform to encode it into a String is no problem (The byte encoding is consistent with the default decoding ). Try again, if we choose UTF-8 encoding when saving the demo.txt file, then the file encoding is not ANSI, and become a UTF-8. Still using the input byte stream to read, then the read byte is different from the last time, this time the byte is UTF-8 encoded bytes. The two bytes are obviously different, a very obvious difference is: GBK each Chinese character two bytes, and UTF-8 each Chinese character three bytes. How can we use new String (byte []); to construct a String object? garbled characters may occur. The reason is very simple, because GBK is decoded by default during construction, and our bytes are UTF-8 bytes. The correct method is to use new String (byte [], "UTF-8"); to construct a String object. In this case, our bytecode and the decoding used by the constructor are the same and there will be no garbled characters.
Let's talk about the byte input stream.
We know that if byte output streams are used to output bytes to a file, we cannot specify the encoding of the generated file (assuming that the file does not exist before). What encoding is the generated file? Tests show that this depends on the written Byte encoding format. For example, the following code:
OutputStream out = new FileOutputStream ("d: \ demo.txt ");
Out. write ("we". getBytes ());
Getbytes(, Here gbkis used. Therefore, we write the bytes encoded by GBK into the demo.txt file. The encoding of this file is GBK. If you slightly modify the program: out. write ("we" .getbytes(%%%8%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%. Another point here is that if we replace "we" with ascii characters like 123 or abc, whether it's using getBytes () or getBytes ("UTF-8 ") all generated files are encoded in GBK format.
Here we can summarize that the byte encoding in InputStream depends on the file encoding, while the encoding of the file generated by OutputStream depends on the byte encoding.
The following describes how to use a character input stream to read files.
First, we need to understand the streaming. In fact, the bytes stream can be seen as a packaging stream. At the underlying layer, it uses byte streams to read bytes. Then, it uses the specified encoding method to decode the read bytes into characters. When talking about the upstreaming stream, you have to mention InputStreamReader. The following is a description of the java api: InputStreamReader is a bridge between byte streams: It reads bytes using the specified charset and decodes them into characters. The character set used can be specified by the name or explicitly specified. Otherwise, the default Character Set of the platform may be accepted. It is actually quite clear that InputStreamReader uses byte streams at the underlying layer to read bytes. After reading bytes, it needs an encoding format to decode the read bytes, if we construct InputStreamReader without passing in the encoding method, the default GBK of the operating system will be used to decode the read bytes. We also use the demo.txt encoding method GBK. We use the following code to read the file:
InputStreamReader in = new InputStreamReader (new fileinputstream(includemo.txt "));
Therefore, we will not generate garbled characters for reading. Because the file uses GBK encoding, the bytes read are also GBK encoded, And the InputStreamReader uses GBK decoding by default. If you replace demo.txt encoding with a UTF-8, then we use this method to read will generate garbled code. This is because of byte encoding (UTF-8) and our decoding code (GBK. The solution is as follows:
InputStreamReader in = new InputStreamReader (new fileinputstream(includemo.txt ")," UTF-8 ");
Specify the decoding encoding for InputStreamReader so that the two will not be garbled in a unified manner.
The following describes the character output stream.
The principle of the character output stream is the same as that of the character input stream. It can also be considered as a packaging stream. The underlying layer of the stream still uses the byte output stream to write files. Only the character output stream converts characters to bytes based on the specified encoding. The main class of the character output stream is OutputStreamWriter. The Java api is interpreted as follows: OutputStreamWriter is a bridge between the bytes stream and the byte stream: encode the characters to be written to the specified charset as bytes. The character set used can be specified by the name or explicitly specified. Otherwise, the default Character Set of the platform may be accepted. It is clear that it requires an encoding to convert the written characters into bytes. If it is not specified, it adopts GBK encoding, and the output bytes will all be GBK encoding, the generated file is also GBK encoded. If OutputStreamWriter is constructed in the following way:
OutputStreamWriter out = new OutputStreamWriter (new fileoutputstream(mongodd.txt ")," UTF-8 ");
Then the characters written will be encoded as the bytes of the UTF-8, And the generated file will also be in UTF-8 format.
Question 2: Since the read file must use the same encoding as the file encoding, The javac compilation file also needs to read the file. What encoding does it use?
I have never thought about this problem, and I have never considered it a problem. It is precisely because of Question 1 that we think about. In fact, there is something that can be mined here. The following three cases are discussed in detail. These three cases are also common methods for compiling java source files.
1. javac compiles java class files on the console.
Generally, we manually create a java file Demo. java and save it. In this case, the Demo. java file is encoded as ANSI, and the Chinese operating system is GBK. Then the javac command is used to compile the source file ." Javac Demo. java ". Javac also needs to read java files, so what encoding does javac use to decode the bytes we read? In fact, javac uses the default GBK encoding of the operating system to decode the bytes we read. This encoding is exactly the encoding of the Demo. java file. The two are the same, so there will be no garbled characters. Let's do something. When saving the Demo. java file, we choose to save the UTF-8. At this time Demo. java file encoding is the UTF-8. We use "javac Demo. java" for compilation. If Demo. java contains Chinese characters, the console displays warning information and garbled characters. The reason is that javac uses GBK encoding to decode the bytes we read. Because our byte is UTF-8 encoded, so there will be garbled. If you don't believe it, you can try it on your own. What is the solution? The solution is to use the encoding parameter of javac to develop our decoding code. The following is the javac-encoding UTF-8 Demo. java. Here we specify the use of UTF-8 to decode the read byte, because the encoding and Demo. java file encoding is consistent, so there will be no garbled.
2. Compile the java file in Eclipse.
I'm used to setting Eclipse encoding to a UTF-8. The coding of the java source file in each project is the UTF-8. In this way, compilation has never been a problem, and there has been no garbled code. It is precisely because of this that the garbled characters that may occur when javac is used are concealed. So Eclipse is how to correctly compile the file encoding for the UTF-8 of java source files? The only explanation is that Eclipse automatically recognizes the file encoding of our java source files, and then uses the correct encoding parameter to compile our java source files. Thanks to IDE's strength.
3. Use Ant to compile java files.
Ant is also a common tool for compiling java files. First, you must know that Ant uses javac to compile java source files in the background. As you can imagine, problems that may occur in Ant also exist. If we use Ant to compile the java source file for UTF-8 encoding and do not specify how to encode it, garbled characters will also occur. So Ant's compilation command <javac> has an attribute "encoding" that allows us to specify encoding. If we want to compile a source file encoded as a java file of the UTF-8, then our command should be as follows:
<Javac destdir = "$ {classes}" target = "1.4" source = "1.4" deprecation = "off" debug = "on" debuglevel = "lines, vars, source "optimize =" off "encoding =" UTF-8 ">
If the encoding is specified, it is equivalent to "javac-encoding", so there will be no garbled characters.
Question 3: jsp compilation in tomcat.
This topic is also raised by question 2. Since javac needs to use the correct encoding to compile the java source file, tomcat also needs to read the file during jsp compilation. What encoding does tomcat use to read the file at this time? Will there be garbled characters? The following is an analysis.
We usually write the following code at the beginning of jsp:
<% @ Page language = "java" contentType = "text/html; charset = UTF-8" pageEncoding = "UTF-8" %>
I often do not write pageEncoding, and I do not understand its function, but it has not been garbled. In fact, this attribute tells tomcat what encoding is used to read jsp files. It should be the same as the encoding of the jsp file. For example, if we create a new jsp file and set the file encoding to GBK, then our pageEncoding should be set to GBK. In this way, the characters written to the file are encoded by GBK, tomcat uses GBK encoding to read files, so it can ensure correct decoding of the read bytes. No gibberish occurs. If pageEncoding is set to a UTF-8, garbled code occurs during transcoding while reading the jsp file. As mentioned above, I often do not write the pageEncoding attribute, but there is no garbled code. This is