Coding problems in Java

Source: Internet
Author: User
Tags array to string

Often encountered in the work of Java coding problems, due to lack of research, always can not give the exact answer, this weekend on the Internet to check some information, here to do some summary.

The first thing we need to know is:

(1) In the computer disk storage is a byte, the network is also transmitted bytes;
(2) The so-called encoding, refers to the characters encoded into a specific encoding format of bytes;

(3) The so-called decoding, refers to the specific encoding format of the bytes decoded into characters.
Question one: What encoding should I use to read a file in Java?

The way Java reads files can be divided into two categories: read by Byte and read by character. Read by Byte is the Inputstream.read () method to read the bytes, then save to a byte[] array, and then often use new string (byte[]), and convert the byte array to string. The last step hides a coded detail, new String (byte[]), and uses the operating system default character set to decode the byte array, and the Chinese operating system is GBK. The bytes we read from the input stream are probably not GBK encoded bytes, because the encoding of the bytes read from the input stream depends on the encoding of the file being read. For example: We create a new file named Demo.txt in the D: Disk and Write "us." ", and save. At this time demo.txt encoding is ANSI, the Chinese operating system is GBK. At this point, the bytes we get from reading the file with the input byte stream are the bytes encoded using the GBK method. Then we finally new String (byte[]), when using the platform default GBK to decode into a string is also no problem (byte encoding and default decoding consistent). Imagine, if we choose UTF-8 encoding when we save the Demo.txt file, the encoding of the file is not ANSI, but it becomes UTF-8. Still using the input byte stream to read, the bytes read at this time are different from the last time, this time the byte is UTF-8 encoded bytes. Two times the byte is obviously different, a very obvious difference is: gbk each character two bytes, and UTF-8 each kanji three bytes. If we finally use new String (byte[]) to construct a string object, it will be garbled for the simple reason that the default decoding used for construction is GBK, and our bytes are UTF-8 bytes. The correct approach is to use the new string (byte[], "UTF-8") to construct a string object. At this point our byte encoding and construction using the decoding is consistent, there will be no garbled problem.

Say the byte input stream, and then the byte output stream.

We know that if a byte output stream is used to output bytes to a file, we cannot specify the encoding of the generated file (assuming the file did not exist before), then what encoding is the generated file? The test finds that this depends on the byte encoding format that is written. such as the following code:

OutputStream out = new FileOutputStream ("D:\\demo.txt");

Out.write ("we". GetBytes ());

GetBytes () encodes bytes using the default character set of the operating system, which is GBK, so we write demo.txt files with GBK encoded bytes. Then the encoding of this file is GBK. If you modify the program a little bit: Out.write ("we". GetBytes ("UTF-8")); At this point we write the byte is UTF-8, then the Demo.txt file encoding is UTF-8. Here's another point, if you replace "we" with ASCII characters such as 123 or ABC, then the resulting file will be GBK encoded either with GetBytes () or GetBytes ("UTF-8").

Here you can summarize that the byte encoding in InputStream depends on the encoding of the file itself, and the encoding of the outputstream generated file is determined by the encoding of the output byte.



The following is the use of a character input stream to read a file.

First, we need to understand the character stream. In fact, a character stream can be seen as a wrapper stream, where the underlying or byte stream is used to read bytes, and then it decodes the read byte into characters using the specified encoding. Speaking of the character stream, have to mention is InputStreamReader. Here's what the Java API says about it: InputStreamReader is a bridge of byte flow to a character stream: it reads bytes with the specified charset and decodes them to characters. The character set it uses can be specified by name or explicitly given, otherwise it will accept the default character set of the platform. In fact, it is clear that inputstreamreader at the bottom or the byte stream to read bytes, read the byte after it needs an encoding format to decode the read bytes, if we are in the construction of InputStreamReader no incoming decoding method, Then the default GBK of the operating system is used to decode the read bytes. Also using the above Demo.txt example, assuming that the Demo.txt encoding method is GBK, we use the following code to read the file:

InputStreamReader in = new InputStreamReader (New FileInputStream ("Demo.txt"));

Then we read no garbled, because the file is GBK encoded, so the read out of the byte is also GBK encoded, and InputStreamReader by default, decoding is GBK. If the Demo.txt encoding method for UTF-8, then we use this way to read will be garbled. This is due to the byte encoding (UTF-8) and our decoding code (GBK). The solution is as follows:

InputStreamReader in = new InputStreamReader (New FileInputStream ("Demo.txt"), "UTF-8");

Specify the decoding code for the InputStreamReader, so that the two unity will not appear garbled.

The following is a word about the character output stream.

The principle of the character output stream is the same as the principle of the character input stream, and can also be seen as a wrapper stream, with the underlying or byte output stream to write the file. Only the character output stream converts characters to bytes according to the specified encoding. The main class of the character output stream is: OutputStreamWriter. The Java API is interpreted as follows: OutputStreamWriter is a bridge of character flow to a byte stream: encodes the character to be written to it using the specified charset. The character set it uses can be specified by name or explicitly given, otherwise it will accept the default character set of the platform. It is clear that it requires a code to convert the characters written to bytes, if not specified by the GBK encoding, then the output of the bytes will be GBK encoding, the resulting file is GBK encoded. If you construct outputstreamwriter in the following ways:

OutputStreamWriter out = new OutputStreamWriter (New FileOutputStream ("Dd.txt"), "UTF-8");

Then the characters written will be encoded as UTF-8 bytes, and the resulting file will be in UTF-8 format.



Question two: Since the read file to use and the file encoding consistent encoding, then Javac compiled file also need to read the file, what encoding does it use?

This question has never been thought of and has never been a problem. It is because of the problem of the thinking, actually there is something to dig. Here are three scenarios to explore, and these three cases are also common methods of compiling Java source files.

1.javac Compile the Java class file on the console.

Usually we manually create a Java file Demo.java and save it. At this point the Demo.java file is encoded as ANSI, and the Chinese operating system is GBK. Then use the Javac command to compile the source file. "Javac Demo.java". Javac also need to read Java files, then what encoding does Javac use to decode the bytes we read? In fact, Javac uses the operating system default GBK encoding to decode the bytes we read, this encoding is also Demo.java file encoding, the two are consistent, so there will be no garbled situation. Let's do something, when we save the Demo.java file, we choose UTF-8 to save. At this point the Demo.java file encoding is UTF-8. We then use "Javac Demo.java" to compile, if the Demo.java contains Chinese characters, at this time the console will appear warning message, also appeared garbled. The reason for this is that Javac uses GBK encoding to decode the bytes we read. Because our bytes are UTF-8 encoded, garbled characters will appear. If you don't believe it, you can try it yourself. What about the solution? The solution is to use Javac's encoding parameter to specify our decoding code. As follows: Javac-encoding UTF-8 Demo.java. Here we specify the use of UTF-8 to decode the read bytes, because this encoding and Demo.java file encoding consistent, so there is no garbled situation.



Compile Java files in 2.Eclipse.

I'm used to setting Eclipse's code to UTF-8. Then the code for Java source files in each project is UTF-8. This compiles also from no problem, also has not appeared garbled. It is precisely because of this to cover up the use of Javac may appear garbled. So how did eclipse properly compile the Java source files encoded as UTF-8? The only explanation is that eclipse automatically recognizes the file encoding of our Java source files and then takes the correct encoding parameters to compile our Java source files. Thanks to the power of the IDE.



3. Use ant to compile the Java file.

Ant is also my common tool for compiling Java files. First of all, it must be known that ant is actually using Javac to compile Java source files in the background, so it is conceivable that the problem of 1 will exist in Ant. If we use ant to compile UTF-8 encoded Java source files, and do not specify how to decode, then there will be garbled situation. So ant's Compile command <javac> have a property "encoding" allows us to specify the encoding, if we want to compile the source file encoded as UTF-8 Java file, then our command should be as follows:

<javac destdir= "${classes}" target= "1.4" source= "1.4" deprecation= "Off" debug= "on" debuglevel= "Lines,vars,source" Optimize= "Off" encoding= "UTF-8" >

Specifies that the encoding is also equivalent to "javac–encoding", so there will be no garbled.



Question three: What happens when you compile JSPs in Tomcat.

This topic is also raised by the question two. Since Javac compiles the Java source file with the correct encoding, then Tomcat compiles the JSP to read the file, and what code does tomcat use to read the file? Will there be garbled cases? Let's analyze it below.

We usually write the following code at the beginning of the JSP:

<%@ page language= "java" contenttype= "text/html; Charset=utf-8 "pageencoding=" Utf-8 "%>

I often do not write pageencoding This attribute, also do not understand its role, but do not write and no garbled situation. In fact, this property tells Tomcat what encoding to use to read the JSP file. It should be consistent with the encoding of the JSP file itself. For example, we create a new JSP file, set the file encoding for GBK, then our pageencoding should be set to GBK, so that the characters we write to the file is GBK encoded, tomcat read the file is also used GBK encoding, so can guarantee the correct decoding of read bytes, No garbled characters are present. If the pageencoding is set to UTF-8, then read the JSP file decoding is garbled, the experiment found that when we set pageencoding in Eclipse, the encoding of the JSP file will change with the table pageencoding changes, When Pageencoding is not set, it will change with the encoding format specified in table CharSet. It says that I often do not write pageencoding this attribute, but also did not appear garbled, this is how to go? That's because Tomcat uses CharSet encoding in contenttype to read JSP files without the pageencoding attribute, and my JSP file encoding is usually set to Utf-8,contenttype CharSet also set to UTF-8 , so Tomcat uses UTF-8 encoding to decode the JSP files that are read, and they are encoded in a consistent way without garbled characters. This is just a function of the charset in the ContentType, it has two functions, and then again. One might ask: if I neither set the Pageencoding property nor set the CharSet property of ContentType, what encoding does Tomcat take to decode the read JSP file? The answer is Iso-8859-1, which is the default value of the Pageencoding property, which is obviously garbled if you use this encoding to read the file.



Question four: output.

The process of problem two and problem three analysis is actually the transcoding situation from the source file to the class file. The final class files are encoded in Unicode, and what we did earlier is to convert various encodings to Unicode, such as converting from GBK to Unicode, from UTF-8 to Unicode. Because only use the correct code to transcode to ensure that no garbled. The JVM is encoded in Unicode at run time, in fact, when output, it will be converted once again. Let's talk about it in two different situations.

The SYSOUT.OUT.PRINTLN output is used in 1.java.

For example: Sysout.out.println ("we"). After the correct decoding "we" is Unicode saved in memory, but in the output to the standard output (console), the JVM has made another transcoding, it will take the operating system default encoding (Chinese operating system is GBK), the in-memory Unicode encoding into GBK encoding, Then output to the console. Because our operating system is a Chinese system, the GBK encoding is used when printing characters to the terminal display device. Because the encoding of the terminal can not be changed manually, so this process is transparent to us, as long as the compilation can be correctly transcoded, the final output will be correct, will not appear garbled. In Eclipse, you can set the character encoding of the console, in the Common tab of the Run Configuration dialog, we can try to set it to UTF-8, where the output is garbled. Because the output is the use of GBK transcoding, but the display is using UTF-8, encoding is different, so there is garbled.



Use OUT.PRINTLN () output in 2.jsp to the client browser.

Once the JSP is compiled into a class, there is a transcoding process if the output is to the client. Java uses the default encoding of the operating system to transcode, so what code does tomcat use to transcode? In fact Tomcat is based on <%@ page language= "java" contenttype= "text/html; Charset=utf-8 "pageencoding=" Utf-8 "%> contenttype charset parameters to Transcode, contenttype is used to set the encoding that Tomcat uses to send HTML content to the browser. Tomcat encodes Unicode in memory based on this encoding. The character encoding of the Tomcat output to the client after transcoding is utf-8. So how does the browser know what encoding format to use to display the received content? This is the third function of the CharSet property of contenttype: This encoding is specified in the HTTP response header to inform the browser. The browser uses the HTTP response header's ContentType CharSet property to display the received content.

Summarize the three roles of ContentType CharSet:

1). When there is no pageencoding attribute, Tomcat uses it to decode the read JSP file.

2). When Tomcat outputs to the client, it is used to transcode what is sent (note: bytes in a particular encoded format).

3). Inform the browser what encoding should be displayed to display the received content.

To get a better understanding of the decoding and transcoding process described above, let's give an example.

Create a new index.jsp file, the file is encoded as GBK, at the beginning of the JSP we write the following code:

<%@ page language= "java" contenttype= "text/html; Charset=utf-8 "pageencoding=" GBK "%>

Here the charset and pageencoding are different, but also will not appear garbled, let me explain. First, Tomcat reads the JSP content and decodes the read GBK byte and converts it to a Unicode bytecode in the class file according to the GBK encoding specified by pageencoding. Tomcat then uses the CharSet property at output (Out.println ()) to convert in-memory Unicode to UTF-8 encoding and notifies the browser in the response header that the browser displays the received content Utf-8. The entire process does not have a transcoding error, so there is no garbled situation.



Question five: The decode encoding used by properties and ResourceBundle.

The above two are our common classes, they do not allow us to specify the decoding encoding during the process of reading the file, then what decoding method do they take? See the source code after the discovery is to use iso-8859-1 encoding to decode. In this way, it is not difficult for us to understand why the properties file we wrote was iso-8859-1. Because any other encoding will produce garbled characters. Because ISO-8859-1 encoding is not in Chinese, so we enter the Chinese to be converted to Unicode, usually we use plug-ins to complete, you can also use the JDK's own native2ascii tool.

Coding problems in Java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.