Java Character set encoding garbled problem

Last Update:2015-01-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Blog Category:

Javajspservlet
Recently done Web page this piece encountered a literal character garbled problem, do not look at this small problem, for me it took a long time. Now let me analyze it slowly (to tell you the truth, some of these are from the Internet, but all of them are self-made, so for themselves not only understand but also deepened the impression).

In the Java internal operation, all the strings involved will be converted UTF-8 encoded to operate, however, before being converted by Java, the string is what kind of character set? In fact, Java always depends on the operating system's default encoding character set to determine the initial encoding of strings, and the Java system input and output is to take the operating system's default encoding. Therefore, if we can unify the input and output of the Java system and the coding character set of the operating system, then the Chinese characters can be processed and displayed correctly. This is also a principle to deal with the characters of Java system.

However, in the actual project, it is difficult to grasp and control the input and output parts of the Java system correctly. In the Java EE, due to the external browser and database, etc., Chinese garbled problem is obvious. The Java EE application is run in the Java EE container, in this system, the input path is many: one is sent to the server through the page form package for requests (request), the second is read through the database, and the third type is more complex, JSP is always compiled into servlet,jsp when it is first run, and when compiled with Javac, Java will be the initial encoding based on the default operating system encoding, unless you specifically specify the encoding.

There are several output paths in the Java EE: The first is the JSP page output, and since the JSP page has been compiled into a servlet, the output encoding will also be selected according to the operating system's default encoding, unless the output encoding is specified, and the second output path is to output the string to the database. So the input and output of a Java EE system is very complex and dynamic. Java is a cross-platform operation, in the actual compilation and operation, may involve different operating systems, if the Java free to use the operating system to determine the input and output encoding character set, it will be uncontrollable garbled.

To deal with this situation, the fundamental approach is to explicitly specify the uniform character set for the entire application system, specifying whether the uniform character set is Iso8859_1, GBK, or UTF-8? Let's analyze the following:

1, if the unified designated as Iso8859_1, because most of the current software is compiled by Westerners, their default character set is Iso8859_1, including Linux and database MySQL. So we generally just need to be aware that the character set is defaulted to Iso8859_1 when the JSP header is declared, the operating system default encoding is running, and when the code is developed and compiled.

2, the unified designation for the GBK Chinese character set, the above mentioned three places need to do the same, the difference is only to run on the encoding default to GBK operating systems, such as Windows, unified code for Iso8859_1 and GBK, although it can bring code to facilitate the development, But they can only operate on the corresponding operating system, so it destroys the superiority of the Java cross-platform operation, and makes sense in a certain range, for example: to make GBK encoding run on Linux, set the Linux encoding to GBK.

3, the unified Code of the JAVA/J2EE system is defined as the UTF-8, then in addition to the application system will not need any additional settings of the Chinese code. UTF-8 is a compatible encoding for all languages, and the only trouble is to find all the entrances to the application system, and then use UTF-8 to change it.

A Java EE application system needs to do the following tasks:
1. When developing and compiling code, the specified character set is Utf-8,jbuild (a Visual Java development tool) and ecplise can be set in the project properties.

2. Using filters, if all requests go through a servlet control allocator, then we use the servlet to execute the statement using the filter to convert all requests from the browser to UTF-8 This is because the browser sent the request including the browser's operating system encoding may be various forms of encoding, we in their data flow in the pass-through settings filtering: request.setcharacterencoding ("UTF-8"); When using filter we need to configure the file in Web. xml to activate the filter. At the same time in the JSP code declaration UTF-8, set the database connection mode is also UTF-8, such as the configuration of the connection to MySQL url:jdbc:mysql://localhost:3306/test?useunicode=true& Charcaterencoding=utf-8;

Let's look at the origin of Java Chinese garbled problem:
Java's kernel and class files are Unicode-based, which makes Java programs very cross-platform, which also brings some Chinese garbled problems. There are two main reasons: Java and JSP files themselves compile the garbled problem and Java program in other media interaction generated garbled problem.

First of all, Java (including JSP) source files are likely to contain Chinese, and Java and JSP source files are saved based on Byte stream, if the Java and JSP compiled into a class file process, the use of encoding and source file encoding inconsistent, there will be garbled. For handling this garbled, it is recommended to try not to write Chinese in the Java file, if must write, as far as possible manually with parameters-ecoding GBK or-ecoding gb2312 compile; for JSPs, add <%@ page ContentType = "In the file header" TEXT/HTML;CHARSET=GBK "%> or <%@ page contenttype=" text/html;charset=gb2312 "%> can basically solve this kind of garbled problem.

The following is to focus on the second kind of garbled problem, that is, Java program and other storage media interaction generated garbled; Many storage media such as databases, files, streams, and so on are based on byte stream , the conversion between a character (char) and a byte (byte) occurs when a Java program interacts with such a medium, as follows:
Page form submission Data---->java Program (BYTE--&GT;CHAR)
Java Program----> page data display (char--->byte)

Database---->java Program (char----->byte)
Java programs----> database (Byte--->char)

File--->java program (BYTE--&GT;CHAR)
Java programs----> files (char---->byte)

Stream--->java program (byte---->char)
Java Program---> Stream (char--->byte)

If the encoding used in the above conversion process is inconsistent with the original encoding of the bytes, it is likely that garbled characters are present.

Workaround: The key is to ensure that the encoding used in the conversion is consistent with the original encoding of the bytes.

1, JSP and page parameters between the garbled
JSP gets the page parameter generally adopts the system default way, if the encoding type of the page parameter is inconsistent with the system's default encoding type, it is likely garbled. So the solution to this is to force the specified request to get the encoding of the parameter before the page gets the parameter: request.setcharacterencoding ("GBK"), or garbled when the JSP outputs the variable to the page. You can set Request.setcontenttype ("TEXT/HTML;CHARSET=GBK"). If you do not want to write these two sentences in each file, the more concise way is to use the filter in the servlet specification to specify the encoding, The typical configuration and main code of the filter in Web. XML is as follows:

Java code

Xml:
<filter>
<filter-name>CharacterEncodingFilter</filter-name>
<filter-class>travel.web.characterencodingfilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>GBK</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>CharacterEncodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
The code in the filter:
Characterencodingfilter.java:
Public class Characterencodingfilter implements Filter {
protected String encoding = null;
Public void Init (Filterconfig filterconfig) throws servletexception {
this.encoding = Filterconfig.getinitparameter ("encoding");
}
Public void DoFilter (ServletRequest request, servletresponse response, Filterchain chain) throws IOException, servletexception {
request.setcharacterencoding (encoding); Response.setcontenttype ("text/html;charset=" +encoding); Chain.dofilter (request, response);
}
}

2. Garbled between Java and database
Most databases support Unicode encoding, so it is wise to solve the garbled problem between Java and database by directly interacting with the database using Unicode encoding. Many database drivers automatically support Unicode, such as Microsoft SQL Server drivers. Most of the other database drivers can be specified in the driver URL parameters, such as MySQL driver: JDBC:MYSQL://LOCALHOST/WEBCLDB?USEUNICODE=TRUE&AMP;CHARACTERENCODING=GBK.

3. Garbled between Java and file/stream
The most commonly used classes for Java read and write files are Fileinputstream/fileoutputstream and Filereader/filewriter. Where FileInputStream and FileOutputStream are byte-stream based, they are often used to read and write binary files. The read-write character file suggests using character-based FileReader and FileWriter, eliminating the conversion between bytes and characters. However, the constructors of these two classes use the system's encoding by default, which may be garbled if the file contents are inconsistent with the system encoding method. In this case, it is recommended to use the parent class of FileReader and FileWriter: Inputstreamreader/outputstreamwriter, which are also character-based, but you can specify the encoding type in the constructor: InputStreamReader (InputStream in, Charset CS) and OutputStreamWriter (OutputStream out, Charset CS).

4, the other methods mentioned above should be able to solve most of the garbled problem, if there are garbled elsewhere, you may need to manually modify the code. The key to solving the Java garbled problem is that in the process of converting bytes and characters, you must know the encoding of the original byte or the converted byte, and the encoding used in the conversion must be consistent with this encoding method.

Java Character set encoding garbled problem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Character set encoding garbled problem

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Character set encoding garbled problem

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support