Java Chinese character encoding

Source: Internet
Author: User

The garbled Chinese Characters in Java have always been a headache. Especially in Web applications. There are many analysis articles and solutions on the Internet, but they are always applicable to certain situations. After many garbled characters, after debugging and searching materials, I finally solved the problem. I thought I had mastered the skills to deal with these character garbled monsters. You can change an application or environment for a while, and you will encounter the annoying fire star text, and once again get at a loss. Therefore, I made up my mind to sort out Chinese character encoding to facilitate my memory and provide a reference for other programmers.

First, you must understand how Java processes characters. Java uses Unicode to store character data. It usually takes three steps to process characters:

-Read character data from the source input stream in the specified character encoding format

-Stores character data in the memory in the form of Unicode encoding

-Encode the character data according to the specified character encoding format and write it into the destination output stream.

Therefore, Java always undergoes two encoding conversions when handling characters. One is to convert from the specified encoding to the Unicode encoding, and the other is to convert from the Unicode encoding to the specified encoding. If characters are decoded incorrectly during reading, the memory stores the wrong Unicode characters. From the character data read from the initial file to the final display of these characters on the screen terminal, it has undergone multiple conversions by the application. If a character is processed in the middle, the character data read from the input stream is decoded in incorrect encoding mode, or the character is written to the output stream in wrong encoding mode, then the receiver of the next character data is decoded.
An error occurs, leading to garbled characters.

This is the guiding ideology for analyzing the character encoding problem and solving the problem.

Well, now we can solve these garbled monsters.

1. hard-coded Chinese characters in the Java file, run in eclipse, and garbled characters are output on the console.

For example, we write the following code in a Java file:

String text = "Hello everyone ";

System. Out. println (text );

If we compile and run the program in eclipse, we may see garbled code like this :????. So why?

Let's take a look at the entire character conversion process.

1. Enter Chinese characters in the eclipse window and save it as a Java file for the UTF-8. Here, character encoding is converted multiple times. But because we believe in the correctness of Eclipse, so we do not need to analyze the process, just believe that the stored Java file is indeed a UTF-8 format.

2. Compile and run this Java file in eclipse. It is necessary to analyze in detail the character encoding conversion during compilation and runtime.

-Compile: When we use javac to compile a Java file, javac does not intelligently guess the encoding type of the file to be compiled. Therefore, it must specify the encoding type used to read the file. By default, javac uses the default character encoding type of the platform to parse java files. The default platform encoding is determined by the operating system. We use a Chinese operating system. The language region settings are generally set in mainland China, and the default platform encoding type is usually GBK. We can use system. getproperty ("file. encoding") in Java to check the encoding type.
See. Therefore, javac uses GBK by default to parse java files. If we want to change the encoding type used by javac, we need to add the-encoding parameter, such as javac-encoding UTF-8 test. java.

Here we also mention that eclipse uses a built-in compiler and does not add parameters. If you want to add parameters for javac, ant is recommended for compilation. This is not because of garbled characters, because eclipse can set the character encoding type for each Java file, and the built-in compiler will compile the Java file based on this setting.

-Run: After compilation, the character data is saved to the bytecode file in unicode format. Then eclipse will call the Java command to run the bytecode file. Because the characters in bytecode are always in unicode format, there is no conversion process for Java to read bytecode files. After the Virtual Machine reads the file, the character data is stored in the memory in unicode format.

3. Call system. Out. println to output characters. Character encoding conversion occurs again.

System. Out. println uses the printstream class to output character data to the console. Printstream uses the default encoding method of the platform to output the character. In our Chinese system, the default mode is GBK. Therefore, Unicode characters in the memory are transcoded into GBK format and sent to the output service of the operating system. Because our operating system is a Chinese system, we use GBK encoding to print characters on the terminal display device. In this step, if our character is no longer GBK encoding, the terminal will display garbled characters.

In eclipse, the Java file with Chinese characters is run, and garbled characters are displayed on the console. What is the conversion error? We will analyze it step by step.

-After saving the Java file as a UTF-8, this step is correct if you open it again without seeing garbled characters.

-It should be no problem to compile and run Java files using eclipse itself.

-System. Out. println encodes the correct Unicode characters in the memory into GBK and then sends them to the eclipse console. Wait, we can see that in the common tag of the run configuration dialog box, the character encoding of the console is set to UTF-8! The problem lies here. System. Out. println has encoded the character into GBK, And the console still reads the character in UTF-8 format, will naturally appear garbled.

Set the character encoding in the console to GBK to solve the garbled problem.

(Here, we add that the console encoding of eclipse inherits the workspace settings. Generally, the console encoding does not have the GBK option and cannot be entered. We can first enter GBK In the workspace encoding settings, and then we can see the GBK options in the settings on the console, after setting, change the character encoding of the workspace to UTF-8 .)

Ii. hard-coded Chinese Characters in JSP files. garbled characters are displayed in the browser.

We use eclipse to compile a JSP page. When Tomcat is used to browse this page, Chinese characters on the page are garbled. Why?

From coding to browsing on a browser, JSP pages have a total of four character encoding/decoding times.

1. Save the JSP file with some character encoding

2. Tomcat uses the specified encoding to read the JSP file and compile it.

3. Tomcat sends HTML content with specified encoding to the browser

4. the browser parses HTML content with the specified Encoding

Once an error occurs, garbled characters are displayed. We will analyze how each character encoding is set in sequence.

-Save the JSP file, which is set in the editor. For example, in eclipse, set the file character type to UTF-8.

-<% @ Page Language = "Java" contenttype = "text/html; charset = UTF-8" pageencoding = "UTF-8" %> at the beginning of the JSP file, pageencoding is used to tell Tomcat the character encoding used for this file. This encoding should be consistent with the encoding used for saving files in eclipse. Tomcat uses this encoding method to read and compile JSP files.

-The contenttype in the page tag is used to set the encoding used by Tomcat to send HTML content to the browser. This encoding will be specified in the HTTP response header to notify the browser.

-The browser parses HTML Content Based on the character encoding specified in the HTTP response header. For example:

HTTP/1.1 200 OK

Date: Mon, 01 Sep 2008 23:13:31 GMT

Server: Apache/2.2.4 (win32) mod_jk/1.2.26

Vary: Host, accept-Encoding

Set-COOKIE: java2000_style_id = 1; domain = www.java2000.net; expires = Thu, 03-nov-2011 09:00:10 GMT; Path =/

Content-encoding: Gzip

Transfer-encoding: chunked

Content-Type: text/html; charset = UTF-8

In addition, charset is also specified in the HTML tag <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8">. However, this character encoding is only valid when the webpage is saved locally as a static webpage. Because there is no HTTP header, the browser uses this label to identify the encoding method of HTML content.

Now the chance of garbled characters in hard encoding in JSP files is relatively small, because we all use the editor such as Eclipse, which can basically automatically ensure the correctness of these encoding settings. Currently, it is more likely to encounter garbled characters generated when reading Chinese characters from other data sources in JSP files.

3. Read the character file from the JSP file and display it on the page. Chinese characters are garbled.

For example, we use the following code in the JSP file:

<%

Bufferedreader reader = new bufferedreader (New filereader ("D: \ test.txt "));

String content = reader. Readline ();

Reader. Close ();

%>

<% = Content %>

Test.txt stores Chinese characters, but garbled characters are displayed in the browser. This is a common problem. We continue to use the previous method to analyze the input and output streams step by step.

1. test.txt saves Chinese Characters in some encoding method, such as UTF-8.

2. bufferedreaderreads the byte content of test.txt directly and constructs the string by default. After analyzing the code of bufferedreader, we can see that bufferedreader calls the read method of filereader, and filereader calls the native read method of fileinputstream. The so-called native method is the underlying method of the operating system. Therefore, the operating system is a Chinese system, so fileinputstream is read in GBK mode by default.
File. Because we keep test.txt using a UTF-8, reading the file content here using GBK is wrong encoding.

3. <% = content %> is actually out. print (content), here again use the HTTP output stream jspwriter, so the string content is encoded as a byte array by the UTF-8 specified in the JSP page tag is sent to the browser side.

4. the browser decodes characters in the way specified in the HTTP header, and whether it is decoded with GBK or UTF-8, the display is garbled.

We can see that the character encoding conversion in the second step error, the UTF-8 string is read into the memory as GBK.

There are two methods to solve this problem. One is to save test.txt as GBK, fileinputstream can correctly read Chinese characters; the other is to use inputstreamreader to convert character encoding, for example:

Inputstreamreader sr = new inputstreamreader (New fileinputstream ("d :\\ test.txt"), "UTF-8 ");

Bufferedreader reader = new bufferedreader (SR );

In this way, Java uses UTF-8 to read character data from the file.

In addition, we can add dfile after the Java command. the encoding parameter specifies the default character encoding used by the VM to read files, such as Java-dfile. encoding = UTF-8 test. In this way, we use system in Java code. getproperty ("file. the value obtained is UTF-8.

4. After JSP reads the Chinese parameters in request. getparameter, garbled characters are displayed on the page.

In Java Web applications, Chinese processing of parameters in the request object has always been a common and most difficult monster. This is often just done, and there is a garbled code. The main cause of this complexity is that there are many character codec times in this process, and neither the browser nor the web server, especially tomcat, can provide us with satisfactory support.

First, we will analyze the garbled characters of parameters uploaded in get mode.

For example, enter the following URL in the address bar of the browser: http: // localhost: 8080/test. jsp? Param = Hello everyone

Our JSP code processes the param parameter as follows:

<% String text = request. getparameter ("Param"); %>

<% = Text %>

In this simple two-sentence code, we are likely to see such garbled code on the page :? Ó ????

There are many articles and methods on the Internet for processing garbled characters in request. getparamter, which are also correct, but there are too many methods that people have never understood. Here we will analyze what is going on.

First, let's take a look at the encoding settings related to the request object:

1. character encoding of JSP files

2. Request the character encoding of the source page with the URL parameter

3. In the advanced settings of IE, the option "sending URL addresses in UTF-8 mode"

4. Configure uriencoding in Tomcat server. xml

5. function request. setcharacterencoding ()

6. js's encodeuricomponent function and Java's urldecoder class

It's no wonder that people are dizzy with so many related encoding settings. Here we will give you an analysis based on various situations. See the table below:

The phenomena in the above table, except for IE7, are all the results tested on IE6.

From this table, we can see that the setting of "sending URL addresses in UTF-8 mode" in IE does not affect the parsing of parameter, the request URL from the page is different from the input URL in the address bar.

According to the phenomena listed in this table, you only need to use smartsniff to capture several network packages and investigate the source code of Tomcat a little. The following conclusions can be drawn:

1. "sending URLs in UTF-8 format" in iesettings only takes effect on the path part of the URL and does not apply to query strings. That is to say, if this option is selected, it is similar to http: // localhost: 8080/test/. jsp? Param = Hello everyone, the previous "Hello everyone" will be converted into UTF-8 format, and the last one will not change. The UTF-8 format mentioned here should be UTF-8 + escape, that is, % B4 % F3 % BC % D2 % Ba % C3.

So what encoding is used to query the Chinese characters in a string and transfer them to the server? The answer is the system default encoding, that is, GBK. That is to say, in our Chinese operating system, the query string sent to the Web server is always encoded in GBK.

2. Request a URL through link, Location redirection, or opening a new window on the page. What encoding is used for Chinese characters in the URL? A: The encoding type of the page. That is to say, if we access http: // localhost: 8080/test. jsp from a link on a source JSP page? Param = Hello everyone, this URL, if the source JSP page encoding is UTF-8, then hello everyone the encoding of these words is UTF-8.

In the address bar, directly enter the URL address or paste it from the system clipboard to the address bar. This input is not initiated from the page, but by the operating system, therefore, this encoding is only the default encoding of the system and is irrelevant to any page. We also found that, in different browsers, pages opened through links, if you press enter on the address bar, the results will be different. IE will not change after you press the Enter key, but may change with garbled characters or garbled characters. If you press enter on IE, the actual sent URL is the memory URL that was previously remembered, and the URL sent from the current address bar on aoyou is retrieved again.

3. If the uriencoding of Tomcat is not set, the ISO-8859-1 is used to decode the URL by default, and the configured encoding method is used to decode the URL. This decoding includes both the path part and the query string part. It can be seen that this parameter is the most critical setting for the Chinese parameters passed in get mode. However, this parameter is only valid for parameters passed in get mode and is invalid for post. After analyzing the source code of Tomcat, we can see that when requesting a page, Tomcat will try to construct a request object. In this object. read the uriencoding value in XML and assign the value to the querystringencoding variable of the parameters class.
The get parameter in request. getparameter is used to guide character decoding.

4. The request. setcharacterencoding function is only valid for post parameters and invalid for get parameters. This function must be used before the first request. getparameter call. This is because the parameters class has two character encoding parameters, one is encoding and the other is querystringencoding, while setcharacterencoding sets encoding, which is used only when post parameters are parsed.

Therefore, we usually need to separate post and get character encoding. the built-in filter of Tomcat can only process post, and set uriencoding to get. This is troublesome and uriencoding cannot dynamically differentiate Encoding Based on the content. It is always a problem.

When investigating Tomcat code, another parameter usebodyencodingforuri in server. XML was found to solve this problem. If this parameter is set to true, Tomcat will use the character encoding set by request. setcharacterencoding to parse the get parameter. In this way, the setcharacterencodingfilter can process both get and post parameters.

After learning the above knowledge, let's analyze several typical phenomena listed in the preceding table.

First, the request Source Page is encoded as a UTF-8, while Tomcat's uriencoding is not specified, then Tomcat uses the ISO8859-1 method to decode the parameter, so after reading from the request, the Unicode data stored in the memory is incorrect. As a result, all subsequent conversions to the screen are incorrect.

Article 9: The request Source Page is encoded as GBK, while Tomcat's uriencoding is also GBK. Tomcat uses the GBK method to decode the originally GBK-encoded characters. The decoding is correct and the Unicode value in the memory is correct, the correct Chinese characters are displayed.

13th, the request Source Page is encoded as UTF-8, Tomcat uriencoding is also UTF-8, and in IE6, the final display of Chinese characters, if it is an odd number, the last will be displayed as garbled. Why?

I guess this is because when IE6 sends the URL address, the query string is directly encoded for characters in the UTF-8 format using GBK, rather than for Unicode characters using GBK encoding, therefore, the data of the UTF-8 is directly encoded into GBK without going through Unicode. And to Tomcat side, GBK encoding has been treated as a UTF-8 decoding. So in this process after the UTF-8 to GBK, and then from gbk to the UTF-8 of the process, and this conversion, the last character of an odd Chinese string is garbled. In IE7, it is estimated that this problem has been fixed as a bug, that is, the sending Address is converted to Unicode and then encoded into GBK. It is estimated that in the IE 7 browser + Chinese operating system environment, if
We set Tomcat's uriencoding to GBK, No matter what format the JSP code is, there will be no garbled characters. This is not tested. Please verify it by yourself.

The other items will not be analyzed. If you are interested, you can analyze them yourself.

5. encode and decode the URL

I personally think that urlencode/urldecode is the best method for garbled Chinese Characters in request parameters, because if your web site needs to support internationalization, it is best to ensure that the parameters delivered from IE are always the correct UTF-8 code.

On the IE side, we can use the JS script to encode the parameter: encodeuricomponent (), after encoding, the Chinese character is changed to % B4 % F3 % BC % D2 % Ba % C3. On the Java side, you can use java.net. urldecoder. decode to decode. However, you must note that Tomcat will automatically perform a decode operation on the URL first. We can see this in the udecoder class of Tomcat. However, Tomcat does not use urldecoder. Decode, but has compiled a decode function. Some articles on the Internet have introduced a method to handle Garbled text, that is, to encodeuricomponent the parameter twice in JS, and to do it in Java
A decode operation can solve some garbled issues that occur when uriencoding is not set. However, I personally think that this method is basically not used if I understand the entire process of character encoding conversion.

6. Read Chinese character data from the database and garbled characters are displayed on the page.

I have encountered few garbled characters reading Chinese characters in the database, so I have not summarized them yet. If you have similar experience, please note that I must be the author.

Well, the analysis of various character garbled characters is summarized here. I believe that we only need to grasp the basic steps of "reading from specified encoding -- converting to Unicode -- inputting with specified encoding, beginners can quickly analyze the root cause of garbled characters. In addition, we recommend that you do not use new string (Str. getbytes (enc1), enc2) to force transcoding, and do not use online character transcoding functions. I think it will only hide the problem deeper and more complex. We should clearly analyze the encoding and decoding process of the entire compaction stream, and naturally find out the root cause of garbled characters, so as to ensure that the Unicode in the memory is always correct throughout the character flow.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.