Text character encoding problems in Java

Source: Internet
Author: User
Tags locale save file tidy

Java in Chinese characters garbled problem has been very annoying. Especially in Web applications. There are many analytic articles and solutions on the web, but they are always specific to certain situations. Many times encountered garbled problem, after very hard debugging and search data finally resolved, full of thought that they have mastered the tricks of these characters garbled monsters. Can be a period of time, change an application or change an environment, and will encounter that nasty Martian text, and again at a loss. So determined to tidy up the Chinese characters encoding problem, to facilitate their own memory, but also for other programmers brothers to provide a reference.

The first thing to know is how Java handles characters. Java uses Unicode to store character data, and there are typically three steps to working with characters:

-Reads character data from the source input stream in the specified character encoding form

-Store character data in memory in Unicode encoding

-Encodes and writes character data to the destination output stream in the specified character encoding format.

So Java always handles characters with two encoding conversions, one at a time from a specified encoding to a Unicode encoding, and one at a time from a Unicode encoding to a specified encoding. If the character is decoded in the wrong form at read time, the memory is stored with the wrong Unicode character. The character data that was read from the original file, to the final display of the characters on the screen terminal, was converted multiple times by the application. If in the middle of a character processing, using the wrong encoding to decode the character data read from the input stream, or the wrong encoding to write characters to the output stream, the receiver of the next character data will be codec error, resulting in the final display garbled.

This is our guide to analyzing character coding problems and solving problems.

OK, now let's start with a solution to these garbled monsters.

One, in the Java file hard-coded Chinese characters, run in Eclipse, the console output garbled.

For example, we write the following code in a Java file:

String Text = "good everyone";

System.out.println (text);

If we were to compile and run in Eclipse, we might see a garbled result like this:????。 So, what is this for?

Let's take a look at the entire character conversion process.

1. Enter Chinese characters in the Eclipse window and save the Java file as UTF-8. Several character encoding transformations have occurred here. But because we believe in Eclipse's correctness, we don't have to analyze the process, just believe that the saved Java file is really UTF-8 format.

2. Compile and run this Java file in eclipse. It is necessary to analyze the character encoding conversion at compile and run time in detail.

-Compile: When we compile a Java file with Javac, Javac will not be smart to guess what type of file you are compiling, so it needs to specify the type of encoding used to read the file. The default Javac uses the platform default character encoding type to parse the Java file. Platform default encoding is determined by the operating system, we use the Chinese operating system, language locale is usually mainland China, so the platform default encoding type is usually GBK. This encoding type can be viewed in Java using System.getproperty ("file.encoding"). So Javac will use GBK to parse the Java file by default. If we want to change the type of encoding used by JAVAC, we need to add-encoding parameters, such as Javac-encoding utf-8 Test.java.

Another thing to mention here is that Eclipse uses a built-in compiler and cannot add parameters, and it is recommended to use ant to compile if you want to add a parameter to Javac. This is not the reason for garbled, however, because Eclipse can set the character encoding type for each Java file, and the built-in compiler compiles Java files based on this setting.

-Run: Post-compilation character data is stored in a bytecode file in Unicode format. Eclipse then calls the Java command to run the bytecode file. Because the characters in the bytecode are always in Unicode format, the Java read bytecode file does not encode the conversion process. After the virtual machine reads the file, the character data is stored in memory in Unicode format.

3. Call System.out.println to output characters. A character-encoding conversion is also occurring here.

SYSTEM.OUT.PRINTLN uses the PrintStream class to output character data to the console. PrintStream uses the default encoding of the platform to output characters. The default mode on our Chinese system is GBK, so the in-memory Unicode characters are transcoded into the GBK format and sent to the output service of the operating system. Because our operating system is a Chinese system, the GBK encoding is used when printing characters to the terminal display device. If this step, our characters are no longer GBK encoding, the terminal will show garbled.

So, in eclipse running the Java file with Chinese characters, the console shows garbled, which step is the conversion error? Let's take a step-by-step analysis.

-After saving the Java file into UTF-8, if you open it again you do not see garbled, indicating that this step is correct.

-Compile and run Java files with eclipse itself, there should be no problem.

-System.out.println will encode the correct Unicode characters in memory into GBK and then send them to the console of Eclipse. Wait, we see that in the Common tab of the Run Configuration dialog box, the character encoding of the console is set to utf-8! The problem is here. SYSTEM.OUT.PRINTLN has encoded characters into GBK, and the console still reads the characters in UTF-8 format, which will naturally appear garbled.

The character encoding of the console is set to GBK and the garbled problem is resolved.

(Add this: Eclipse's console encoding is inherited from the workspace setting, and there is usually no GBK option in the console encoding and cannot be entered.) We can first enter the workspace encoding settings in the GBK, and then in the console settings can see the GBK options, set up and then change the workspace character encoding settings back to Utf-8 is. )

Second, the JSP file in hard-coded Chinese characters, in the browser display garbled.

We use eclipse to write a JSP page, when using Tomcat to browse this page, the entire page of the Chinese characters are garbled. What is the reason for this?

JSP pages from writing to browsing in the browser, a total of four character codec.

1. Save the JSP file in a character encoding

2. Tomcat reads the JSP file and compiles it with the specified encoding

3. Tomcat sends HTML content to the browser with the specified encoding

4. Browser parses HTML content with specified encoding

Here the four-character codec, one occurrence of the error will eventually display is garbled. In turn, we analyze how the character encodings are set.

-Save the JSP file, which is set in the editor, such as Eclipse, the set file character type is utf-8.

-JSP file at the beginning of the <%@ page language= "java" contenttype= "text/html; Charset=utf-8 "pageencoding=" Utf-8 "%>, where pageencoding is used to tell Tomcat the character encoding used for this file. This encoding should be consistent with the encoding used by Eclipse's save file. Tomcat reads the JSP file and compiles it in this way.

-The ContentType in the page tag is used to set the encoding used by Tomcat to send HTML content to the browser. This encoding is specified in the HTTP response header to notify the browser.

-the browser parses the HTML content according to the character encoding specified in the HTTP response header. Such as:

http/1.1 OK

Date:mon, Sep 23:13:31 GMT

server:apache/2.2.4 (WIN32) mod_jk/1.2.26

Vary:host,accept-encoding

set-cookie:java2000_style_id=1; Domain=www.java2000.net; Expires=thu, 03-nov-2011 09:00:10 GMT; path=/

Content-encoding:gzip

Transfer-encoding:chunked

Content-type:text/html;charset=utf-8

In addition, the HTML has a label <meta http-equiv= "Content-type" content= "text/html; CharSet is also specified in the Charset=utf-8 ">. However, this character encoding is only valid when the page is saved locally as a static Web page, because there is no HTTP header, so the browser recognizes the encoding of the HTML content based on this tag.

Now in the JSP file in the hard-coded garbled opportunity is relatively small, because everyone uses such as Eclipse Editor, basically can automatically guarantee the correctness of these several coding settings. It is now more of a garbled problem to read Chinese characters from other data sources in a JSP file.

Third, in the JSP file to read the character file and display in the page, Chinese characters are displayed as garbled.

For example, we use the following code in a JSP file:

<%

BufferedReader reader = new BufferedReader (New FileReader ("D:\\test.txt"));

String content = Reader.readline ();

Reader.close ();

%>

<%=content%>

The Test.txt is a Chinese character, but it is garbled on the browser. This is a frequently encountered problem. We continue to analyze the input and output streams using the previous method step-by-step

1. Test.txt is a coded way to save Chinese characters, such as UTF-8.

2. BufferedReader directly reads the byte contents of the Test.txt and constructs the string by default. Parsing BufferedReader's code, we can see that BufferedReader called the FileReader Read method, and FileReader called FileInputStream's native Read method. The so-called native method is the underlying method of the operating system. Then our operating system is the Chinese system, so fileinputstream default to read the file GBK way. Since we are saving test.txt with UTF-8, it is wrong to read the contents of the file here using GBK.

3. <%=content%> is actually out.print (content), which uses the HTTP output stream JspWriter, The string content is then encoded as a byte array in the UTF-8 mode specified in the page tag of the JSP to be sent to the browser side.

4. The browser decodes the characters in the way specified in the HTTP header, at which point the decoding is either GBK or UTF-8, and the display is garbled.

As we can see, our character encoding conversion went wrong in the second step, and the UTF-8 string was read into memory as GBK.

Solve this garbled problem there are two methods, one is to save Test.txt with GBK, then FileInputStream can read the Chinese characters correctly, and the second is to use InputStreamReader to convert character encoding, such as:

InputStreamReader sr = new InputStreamReader (New FileInputStream ("D:\\test.txt"), "Utf-8");

BufferedReader reader = new BufferedReader (SR);

In this way, Java will use Utf-8 to read character data from the file.

In addition, we can specify the default character encoding used by the virtual machine to read the file by taking the dfile.encoding parameter after the Java command, for example, Java-dfile.encoding=utf-8 Test, so that We use System.getproperty ("File.encoding") in Java code to take the value utf-8.

Java in Chinese characters garbled problem has been very annoying. Especially in Web applications. There are many analytic articles and solutions on the web, but they are always specific to certain situations. Many times encountered garbled problem, after very hard debugging and search data finally resolved, full of thought that they have mastered the tricks of these characters garbled monsters. Can be a period of time, change an application or change an environment, and will encounter that nasty Martian text, and again at a loss. So determined to tidy up the Chinese characters encoding problem, to facilitate their own memory, but also for other programmers brothers to provide a reference.

The first thing to know is how Java handles characters. Java uses Unicode to store character data, and there are typically three steps to working with characters:

-Reads character data from the source input stream in the specified character encoding form

-Store character data in memory in Unicode encoding

-Encodes and writes character data to the destination output stream in the specified character encoding format.

So Java always handles characters with two encoding conversions, one at a time from a specified encoding to a Unicode encoding, and one at a time from a Unicode encoding to a specified encoding. If the character is decoded in the wrong form at read time, the memory is stored with the wrong Unicode character. The character data that was read from the original file, to the final display of the characters on the screen terminal, was converted multiple times by the application. If in the middle of a character processing, using the wrong encoding to decode the character data read from the input stream, or the wrong encoding to write characters to the output stream, the receiver of the next character data will be codec error, resulting in the final display garbled.

This is our guide to analyzing character coding problems and solving problems.

OK, now let's start with a solution to these garbled monsters.

One, in the Java file hard-coded Chinese characters, run in Eclipse, the console output garbled.

For example, we write the following code in a Java file:

String Text = "good everyone";

System.out.println (text);

If we were to compile and run in Eclipse, we might see a garbled result like this:????。 So, what is this for?

Let's take a look at the entire character conversion process.

1. Enter Chinese characters in the Eclipse window and save the Java file as UTF-8. Several character encoding transformations have occurred here. But because we believe in Eclipse's correctness, we don't have to analyze the process, just believe that the saved Java file is really UTF-8 format.

2. Compile and run this Java file in eclipse. It is necessary to analyze the character encoding conversion at compile and run time in detail.

-Compile: When we compile a Java file with Javac, Javac will not be smart to guess what type of file you are compiling, so it needs to specify the type of encoding used to read the file. The default Javac uses the platform default character encoding type to parse the Java file. Platform default encoding is determined by the operating system, we use the Chinese operating system, language locale is usually mainland China, so the platform default encoding type is usually GBK. This encoding type can be viewed in Java using System.getproperty ("file.encoding"). So Javac will use GBK to parse the Java file by default. If we want to change the type of encoding used by JAVAC, we need to add-encoding parameters, such as Javac-encoding utf-8 Test.java.

Another thing to mention here is that Eclipse uses a built-in compiler and cannot add parameters, and it is recommended to use ant to compile if you want to add a parameter to Javac. This is not the reason for garbled, however, because Eclipse can set the character encoding type for each Java file, and the built-in compiler compiles Java files based on this setting.

-Run: Post-compilation character data is stored in a bytecode file in Unicode format. Eclipse then calls the Java command to run the bytecode file. Because the characters in the bytecode are always in Unicode format, the Java read bytecode file does not encode the conversion process. After the virtual machine reads the file, the character data is stored in memory in Unicode format.

3. Call System.out.println to output characters. A character-encoding conversion is also occurring here.

SYSTEM.OUT.PRINTLN uses the PrintStream class to output character data to the console. PrintStream uses the default encoding of the platform to output characters. The default mode on our Chinese system is GBK, so the in-memory Unicode characters are transcoded into the GBK format and sent to the output service of the operating system. Because our operating system is a Chinese system, the GBK encoding is used when printing characters to the terminal display device. If this step, our characters are no longer GBK encoding, the terminal will show garbled.

So, in eclipse running the Java file with Chinese characters, the console shows garbled, which step is the conversion error? Let's take a step-by-step analysis.

-After saving the Java file into UTF-8, if you open it again you do not see garbled, indicating that this step is correct.

-Compile and run Java files with eclipse itself, there should be no problem.

-System.out.println will encode the correct Unicode characters in memory into GBK and then send them to the console of Eclipse. Wait, we see that in the Common tab of the Run Configuration dialog box, the character encoding of the console is set to utf-8! The problem is here. SYSTEM.OUT.PRINTLN has encoded characters into GBK, and the console still reads the characters in UTF-8 format, which will naturally appear garbled.

The character encoding of the console is set to GBK and the garbled problem is resolved.

(Add this: Eclipse's console encoding is inherited from the workspace setting, and there is usually no GBK option in the console encoding and cannot be entered.) We can first enter the workspace encoding settings in the GBK, and then in the console settings can see the GBK options, set up and then change the workspace character encoding settings back to Utf-8 is. )

Second, the JSP file in hard-coded Chinese characters, in the browser display garbled.

We use eclipse to write a JSP page, when using Tomcat to browse this page, the entire page of the Chinese characters are garbled. What is the reason for this?

JSP pages from writing to browsing in the browser, a total of four character codec.

1. Save the JSP file in a character encoding

2. Tomcat reads the JSP file and compiles it with the specified encoding

3. Tomcat sends HTML content to the browser with the specified encoding

4. Browser parses HTML content with specified encoding

Here the four-character codec, one occurrence of the error will eventually display is garbled. In turn, we analyze how the character encodings are set.

-Save the JSP file, which is set in the editor, such as Eclipse, the set file character type is utf-8.

-JSP file at the beginning of the <%@ page language= "java" contenttype= "text/html; Charset=utf-8 "pageencoding=" Utf-8 "%>, where pageencoding is used to tell Tomcat the character encoding used for this file. This encoding should be consistent with the encoding used by Eclipse's save file. Tomcat reads the JSP file and compiles it in this way.

-The ContentType in the page tag is used to set the encoding used by Tomcat to send HTML content to the browser. This encoding is specified in the HTTP response header to notify the browser.

-the browser parses the HTML content according to the character encoding specified in the HTTP response header. Such as:

http/1.1 OK

Date:mon, Sep 23:13:31 GMT

server:apache/2.2.4 (WIN32) mod_jk/1.2.26

Vary:host,accept-encoding

set-cookie:java2000_style_id=1; Domain=www.java2000.net; Expires=thu, 03-nov-2011 09:00:10 GMT; path=/

Content-encoding:gzip

Transfer-encoding:chunked

Content-type:text/html;charset=utf-8

In addition, the HTML has a label <meta http-equiv= "Content-type" content= "text/html; CharSet is also specified in the Charset=utf-8 ">. However, this character encoding is only valid when the page is saved locally as a static Web page, because there is no HTTP header, so the browser recognizes the encoding of the HTML content based on this tag.

Now in the JSP file in the hard-coded garbled opportunity is relatively small, because everyone uses such as Eclipse Editor, basically can automatically guarantee the correctness of these several coding settings. It is now more of a garbled problem to read Chinese characters from other data sources in a JSP file.

Third, in the JSP file to read the character file and display in the page, Chinese characters are displayed as garbled.

For example, we use the following code in a JSP file:

<%

BufferedReader reader = new BufferedReader (New FileReader ("D:\\test.txt"));

String content = Reader.readline ();

Reader.close ();

%>

<%=content%>

The Test.txt is a Chinese character, but it is garbled on the browser. This is a frequently encountered problem. We continue to analyze the input and output streams using the previous method step-by-step

1. Test.txt is a coded way to save Chinese characters, such as UTF-8.

2. BufferedReader directly reads the byte contents of the Test.txt and constructs the string by default. Parsing BufferedReader's code, we can see that BufferedReader called the FileReader Read method, and FileReader called FileInputStream's native Read method. The so-called native method is the underlying method of the operating system. Then our operating system is the Chinese system, so fileinputstream default to read the file GBK way. Since we are saving test.txt with UTF-8, it is wrong to read the contents of the file here using GBK.

3. <%=content%> is actually out.print (content), which uses the HTTP output stream JspWriter, The string content is then encoded as a byte array in the UTF-8 mode specified in the page tag of the JSP to be sent to the browser side.

4. The browser decodes the characters in the way specified in the HTTP header, at which point the decoding is either GBK or UTF-8, and the display is garbled.

As we can see, our character encoding conversion went wrong in the second step, and the UTF-8 string was read into memory as GBK.

Solve this garbled problem there are two methods, one is to save Test.txt with GBK, then FileInputStream can read the Chinese characters correctly, and the second is to use InputStreamReader to convert character encoding, such as:

InputStreamReader sr = new InputStreamReader (New FileInputStream ("D:\\test.txt"), "Utf-8");

BufferedReader reader = new BufferedReader (SR);

In this way, Java will use Utf-8 to read character data from the file.

In addition, we can specify the default character encoding used by the virtual machine to read the file by taking the dfile.encoding parameter after the Java command, for example, Java-dfile.encoding=utf-8 Test, so that We use System.getproperty ("File.encoding") in Java code to take the value utf-8.

Text character encoding problems in Java

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.