JSP Chinese problem solving

Last Update:2018-12-05 Source: Internet

Author: User

Tags string to file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solve Chinese garbled characters

Author: Source:
Access times: Join time:

I recently saw N people asking Chinese questions on this site to help you solve them. The solution to the Chinese problem is listed as follows. * Note: it is not original. This is usually collected *.
I. Source of Chinese problems
The computer's initial operating system supports single-byte character encoding. Therefore, in the computer, all processing programs are initially processed in English based on the single-byte encoding. With the development of computers, in order to adapt to the languages of other nations in the world (including our Chinese characters, of course), we have proposed unicode encoding, which uses dual-byte encoding, it is compatible with double-byte encoding of English characters and other nationalities. Therefore, most international software currently adopts unicode encoding, it obtains the default supported encoding formats of the Local Support System (most of the time is the operating system), and then converts the Unicode in the software to the supported formats by the local system by default. The same is true for Java's JDK and JVM. JDK here refers to the international version of JDK. Most of our programmers use the international version of JDK, all of the following JDK versions refer to the international JDK version. Our Chinese characters are double-byte encoding languages. In order to allow computers to process Chinese characters, we have developed standards such as gb2312, GBK, and gbk2k to meet the requirements of computer processing. Therefore, most operating systems have customized Chinese operating systems to meet our Chinese processing needs. They use GBK and gb2312 encoding formats to correctly display our Chinese characters. For example, the Chinese Win2k adopts GBK encoding display by default. When saving a file in Win2k, the encoding format of the saved file is also GBK, that is, the internal encoding of all files stored in Win2k by default adopts GBK encoding. Note: GBK is extended based on gb2312.
Because the Java language uses unicode encoding internally, when Java is running, there is a problem of converting the encoding formats supported by Unicode encoding and the corresponding operating system and the browser, this conversion process involves a series of steps. If any of these steps fails, the displayed Chinese characters are garbled, which is a common Java Chinese problem.
At the same time, Java is a cross-platform programming language, that is, the programs we write can not only run on Chinese Windows, but also on Chinese Linux and other systems, at the same time, it is required to run on systems such as English (we often see that some people have transplanted Java programs written on the Chinese Win2k to English Linux to run ). This kind of porting operation will also cause Chinese problems.
In addition, some people use English operating systems, Internet Explorer and other browsers to run programs with Chinese characters and browse Chinese Web pages. They do not support Chinese characters and may also cause Chinese problems.
Yes, almost all the browsers by default when passing parameters are in UTF-8 encoding format to pass, rather than by Chinese encoding transfer, so, when passing Chinese parameters will also have problems, resulting in garbled phenomenon.
In short, the above aspects are the main source of Chinese problems in Java. We call the problems caused by the failure of the program to run correctly due to the above reasons: Java Chinese problems.
2. detailed process of Java encoding and conversion
Common Java programs include:
* Classes that run directly on the console (including visual interface classes)
* JSP code class (Note: JSP is a variant of the servlets class)
* Servelets class
* EJB class
* Other support classes that cannot be directly run
These class files may contain Chinese strings, and we often use the first three types of Java programs to directly interact with users for output and input characters, such: we get the characters sent from the client in JSP and Servlet, which also contain Chinese characters. Regardless of the role of these Java classes, the lifecycle of these Java programs is as follows:
* The programmer selects an appropriate editing software on a certain operating system to implement the source code and. the Java extension is stored in the operating system. For example, you can use NotePad to edit a Java source program in Win2k;
* Programmers use javac.exe in JDK to compile the source code to form a. Class class (JSP files are compiled by the container by calling JDK );
* Directly run these classes or deploy these classes to Web containers for running and output the results.
In these processes, how does JDK and JVM encode, decode, and run these files?
Here, we use the Chinese Win2k operating system as an example to illustrate how Java classes are encoded and decoded.
Step 1: compile a Java source program file (including the above five types of Java programs) with editing software such as notepad in Win2k ), by default, program files are saved in the GBK encoding format supported by the operating system (the default format supported by the operating system is file. encoding format) to form. java files, that is, before the Java program is compiled, our Java source program files use the default file supported by the operating system. the encoding format is saved. The JAVA source program contains Chinese characters and English program code. You need to view the system file. you can use the following code to encoding the parameter:
Public class showsystemdefaultencoding {
Public static void main (string [] ARGs ){
String encoding = system. getproperty ("file. encoding ");
System. Out. println (encoding );
}}
Compile first obtains the default encoding format used by the operating system, that is, when compiling a Java program, if we do not specify the encoding format of the source program file, JDK first obtains the file of the operating system. the encoding parameter (which stores the default encoding format of the operating system, such as Win2k, whose value is GBK). Then, JDK extracts our Java source program from file. the encoding format is converted to the Java internal default Unicode format and placed into the memory. Then, javac compiles the converted unicode format file. class file. the class file is unicode encoded and is temporarily stored in the memory. Then, JDK saves the compiled class file encoded with Unicode to our operating system to form what we see. class file. For us, what we finally get. A class file is a class file whose content is saved in Unicode encoding format. It contains a Chinese character string in our source program, but it has been written by file. the encoding format is converted to the unicode format.
In this step, the JSP source code files are different. For JSP, the process is as follows: that is, the Web Container calls the JSP compiler, the JSP compiler first checks whether the JSP file has a file encoding format. If the JSP file does not have a JSP file encoding format set, the JSP compiler calls JDK to use the default JVM character encoding format (that is, the default file of the operating system where the Web container is located) for the JSP file. encoding) is converted to a temporary servlet class, then compiled into a class in unicode format, and saved in a temporary folder. For example, in the Chinese Win2k, the Web Container converts the JSP file from the GBK encoding format to the unicode format and then compiles it into a temporarily saved servlet class to respond to user requests.
Step 3: run the classes compiled in step 2:
A. classes run directly on the console
B. EJB class and support class that cannot be directly run (such as JavaBean class)
C. JSP code and Servlet class
D. Between Java programs and databases
Let's look at these four situations.
A. classes run directly on the console
In this case, JVM is required to run this class, that is, JRE must be installed in the operating system. The running process is as follows: first, start JVM in Java. At this time, JVM reads the class file stored in the operating system and reads the content into the memory. At this time, the class in unicode format is used in the memory, then the JVM runs it. If this class needs to receive user input at this time, the class uses file by default. the encoding format encodes the string you entered and converts it to Unicode and saves it To the memory (you can set the encoding format of the input stream ). After the program runs, the generated string (unicode encoded) is handed back to JVM, and then the JRE converts the string to file. the encoding format (you can set the encoding format of the output stream) is passed to the operating system display interface and output to the interface.
The conversion of each step above requires correct encoding format conversion to avoid garbled characters.
B. EJB class and support class that cannot be directly run (such as JavaBean class)
Because EJB classes and support classes that cannot be directly run, they generally do not directly interact with users for input and output. They often interact with other classes for input and output, therefore, after the second step is compiled, the classes whose content is unicode encoded are saved in the operating system, in the future, as long as its interaction with other classes is not lost during parameter transmission, it will run correctly.
C. JSP code and Servlet class
After step 2, the JSP file is also converted to a servlets file, but it does not exist in the classes directory like the standard servlets one, it exists in the temporary directory of the Web container, in this step, we also use it as the servlets.
For Servlets, when the client requests it, the Web Container calls its JVM to run the servlet. First, the JVM reads the servlet class from the system and loads it into the memory, the servlet class code in the memory is unicode encoded, and then the JVM runs the servlet class in the memory. If the servlet is running, it needs to accept characters sent from the client, such: the value entered in the form and the value entered in the URL. If no encoding format is set in the program, the Web Container uses the ISO-8859-1 encoding format by default to accept incoming values and relay to unicode format in the memory of the Web Container in JVM. After the servlet runs, the output string is in unicode format. Then, the container runs the Unicode string generated by the servlet (such as HTML syntax and user output string) it is directly sent to the client browser and output to the user. If the encoding format specified for sending is specified, it is output to the browser according to the specified encoding format. If not specified, by default, it is sent to the client's browser in ISO-8859-1 encoding.
D. Between Java programs and databases
For almost all the JDBC drivers of the database, the default transfer data between the Java program and the database is in the ISO-8859-1 as the default encoding format, so, when our program stores data containing Chinese characters to the database, JDBC first converts the data in the Unicode encoding format inside the program to the ISO-8859-1 format, and then passes it to the database, when the database saves the data, it is saved by ISO-8859-1 by default, so this is why the Chinese data we often read in the database is garbled.
3. Several principles that must be clarified when analyzing common Java Chinese problems
First of all, after detailed analysis above, we can clearly see that the key process of coding conversion for any Java program in its lifecycle is: the transcoding process that is initially compiled into a class file and ultimately output to the user.
Secondly, we must understand the following common encoding formats supported by Java during compilation:
* ISO-8859-1, 8-bit, with 8859_1, ISO-8859-1, iso_8859_1 and Other encoding
* Cp1252, American English code, same as ANSI Standard Code
* UTF-8, same unicode encoding
* Gb2312, same as gb2312-80, gb2312-1980, etc.
* GBK, same as ms936, is an extension of gb2312.
And other codes, such as Korean, Japanese, and traditional Chinese. At the same time, we should note that the compatibility between these encodings is as follows:
Unicode and UTF-8 encoding are a one-to-one relationship. Gb2312 can be considered as a subset of GBK, that is, GBK encoding is extended on gb2312. At the same time, GBK encoding contains 20902 Chinese characters in the range of 0x8140-0xfefe. All the characters can correspond to unicode2.0 one by one.
Again, for the. Java source program file stored in the operating system, we can specify the encoding format of its content during compilation. Specifically, we can use-encoding to specify it. Note: If the source program contains Chinese characters and you use-encoding to specify other encoding characters, it is obviously wrong. Use-encoding to specify the source file encoding method as GBK or gb2312. No matter what system we compile a Java source program containing Chinese characters, it will correctly convert Chinese to Unicode and store it in the class file.
Then, we must be clear that almost all web containers in their internal default character encoding formats are based on ISO-8859-1 as the default value, at the same time, almost all browsers PASS Parameters in UTF-8 by default. Therefore, although our Java source file specifies the correct encoding method at the entrance, it is also handled by ISO-8859-1 when running inside the container.
4. Classification of Chinese problems and recommended optimal solutions
After learning about the above Java File Processing principles, we can propose a set of recommended methods to best solve the problem of Chinese characters.
Our goal is to compile the Java source program that contains Chinese strings or processes Chinese in the Chinese system and then move the value to any other operating system for proper operation, or, after compilation in other operating systems, it can run correctly, pass Chinese and English parameters correctly, and communicate with the database in Chinese and English strings correctly.
Our specific idea is to restrict the correct encoding method at the entry and exit of Java program transcoding and at the same time as the user's input/output conversion.
The specific solution is as follows:
1. For classes that run directly on the console
In this case, we recommend that you use the RST stream in the program to process the input and output if you want to receive the user's input or output that may contain Chinese characters from the client, specifically, the application is applicable to the following traffic types for ignore nodes:
File: filereader, filewrieter
The Byte node stream types are: fileinputstream and fileoutputstream.
Memory (array): chararrayreader, chararraywriter
Bytearrayinputstream and bytearrayoutputstream
Memory (string): stringreader, stringwriter
Pipeline: pipedreader and pipedwriter
The Byte node stream types are pipedinputstream and pipedoutputstream.
At the same time, you should use the following object-oriented stream for processing Input and Output:
Bufferedwriter, bufferedreader
The byte processing stream is bufferedinputestream and bufferedoutputstream.
Inputstreamreader, outputstreamwriter
The byte processing stream is datainputstream and dataoutputstream.
Inputstreamreader and inputstreamwriter are used to convert a byte stream to a bytes stream based on the specified character sequence set, for example:
Inputstreamreader in = new inputstreamreader (system. In, "gb2312 ");
Outputstreamwriter out = new outputstreamwriter (system. Out, "gb2312 ");
For example, the following example of Java encoding meets the requirements:
// Read. Java
Import java. Io .*;
Public class read {
Public static void main (string [] ARGs) throws ioexception {
String STR = "Chinese test, this is an internal hard-coded string" + "test English character ";
String strin = "";
Bufferedreader stdin = new bufferedreader (New inputstreamreader (system. In, "gb2312"); // sets the input interface to be encoded in Chinese.
Bufferedwriter stdout = new bufferedwriter (New outputstreamwriter (system. Out, "gb2312"); // sets the output interface to be encoded in Chinese.
Stdout. Write ("Enter :");
Stdout. Flush ();
Strin = stdin. Readline ();
Stdout. Write ("this is from the user input string:" + strin );
Stdout. Write (STR );
Stdout. Flush ();
}}
At the same time, we use the following methods to compile the program:
Javac-encoding gb2312 read. Java
2. Support classes for EJB classes and those that cannot be directly run (such as JavaBean classes)
Because they are called by other classes and do not directly interact with users, we recommend that the internal program use the character stream to process the Chinese character strings in the Program (as in the previous section). At the same time, when compiling a class, use the-encoding gb2312 parameter to indicate that the source file is encoded in Chinese format.
3. For Servlet
For servlet, we recommend that you use the following methods:
When compiling the source program of the servlet class, use-encoding to specify the Encoding As GBK or gb2312, and use the setcontenttype ("text/html; charset = GBK "); or gb2312 to set the output encoding format. Similarly, when receiving user input, we use request. setcharacterencoding ("gb2312"); in this way, no matter which operating system our servlet class is transplanted to, only the browser of the client supports Chinese display. The following is a correct example:
// Helloworld. Java
Package hello;
Import java. Io .*;
Import javax. servlet .*;
Import javax. servlet. http .*;
Public class helloworld extends httpservlet
{
Public void Init () throws servletexception {}
Public void doget (httpservletrequest request, httpservletresponse response) throws ioexception, servletexception
{
Request. setcharacterencoding ("gb2312"); // you can specify the input encoding format.
Response. setcontenttype ("text/html; charset = gb2312"); // you can specify the output encoding format.
Printwriter out = response. getwriter (); // printwriter output is recommended.
Out. println ("<HR> ");
Out. println ("Hello world! This is created by Servlet! Test Chinese! ");
Out. println ("<HR> ");
}
Public void dopost (httpservletrequest request, httpservletresponse response) throws ioexception, servletexception
{
Request. setcharacterencoding ("gb2312"); // you can specify the input encoding format.
Response. setcontenttype ("text/html; charset = gb2312"); // you can specify the output encoding format.
String name = request. getparameter ("name ");
String id = request. getparameter ("ID ");
If (name = NULL) name = "";
If (ID = NULL) id = "";
Printwriter out = response. getwriter (); // printwriter output is recommended.
Out. println ("<HR> ");
Out. println ("your input Chinese string is:" + name );
Out. println ("<HR> the id you entered is:" + id );
Out. println ("<HR> ");
}
Public void destroy (){}
}
Use javac-encoding gb2312 helloworld. Java to compile this program.
The Program for testing this servlet is as follows:
<% @ Page contenttype = "text/html; charset = gb2312" %>
<% Request. setcharacterencoding ("gb2312"); %>
<HTML> <Script language = "JavaScript">
Function submit (){
// Pass the Chinese string value to servlet through URL
Document. Base. Action = "./helloworld? Name = Chinese ";
Document. Base. method = "Post ";
Document. Base. Submit ();
}
</SCRIPT>
</Head>
<Body bgcolor = "# ffffff" text = "#000000" topmargin = "5">
<Form name = "base" method = "Post" target = "_ Self">
<Input name = "ID" type = "text" value = "" size = "30">
<A href = "javascript: Submit ()"> send to servlet </a>
</Form> </body> 4. Between Java programs and databases
To avoid gibberish during data transmission between Java programs and databases, we recommend that you use the following optimal methods:
1. the Java program is processed in the method we specify.
2. Change the default supported encoding format of the database to GBK or gb2312.
For example, in MySQL, we can add the following statement in the configuration file my. ini:
Add the following in the [mysqld] area:
Default-character-set = GBK
And added:
[Client]
Default-character-set = GBK
In SQL Server 2 K, we can set the default language of the database to simplified Chinese.
5. JSP code
Since JSP is dynamically compiled by the Web container at runtime, if the encoding format of the JSP source file is not specified, the JSP compiler will obtain the file of the server operating system. the value of encoding is used to compile JSP files. It is the most prone to problems during transplantation. For example, if a JSP file that is lucky enough in the Chinese Win2k language cannot be obtained in English Linux, although the client is the same, it is because the encoding of the operating system obtained by the container when compiling the JSP file is different (in the Chinese wink file. encoding and file in English Linux. encoding is different, and the file in English Linux. encoding does not support Chinese characters, so the compiled JSP class will be faulty ). Most of the issues discussed on the network are these problems, mostly because they cannot be correctly displayed when the JSP file is transplanted to the platform. For these problems, we understand the principle of program encoding conversion in Java, it is much easier to solve. The recommended solution is as follows:
1. We need to ensure that the JSP is output in Chinese encoding to the client. In any case, we first Add the following line to our JSP Source Code:
<% @ Page contenttype = "text/html; charset = gb2312" %>
2. In order for JSP to correctly obtain the input parameters, we add the following sentence to the JSP Source File Header:
<% Request. setcharacterencoding ("gb2312"); %>
3. In order for the JSP compiler to correctly decode our JSP file containing Chinese characters, We need to specify the encoding format of our JSP source file in the JSP source file. Specifically, add the following sentence to the JSP Source File Header:
<% @ Page pageencoding = "gb2312" %> or <% @ page pageencoding = "GBK" %>
This is a newly added instruction in JSP specification 2.0.
We recommend that you use this method to solve the problem of Chinese Characters in JSP files. The following code is a test program for JSP files in the correct way:
// Testchinese. jsp
<% @ Page pageencoding = "gb2312" %>
<% @ Page contenttype = "text/html; charset = gb2312" %>
<% Request. setcharacterencoding ("gb2312"); %>
<%
String action = request. getparameter ("action ");
String name = "";
String STR = "";
If (action! = NULL & Action. Equals ("sent "))
{
Name = request. getparameter ("name ");
STR = request. getparameter ("str ");
}
%>
<HTML>
<Head>
<Title> </title>
<Script language = "JavaScript">
Function submit ()
{
Document. Base. Action = "? Action = sent & STR = input Chinese ";
Document. Base. method = "Post ";
Document. Base. Submit ();
}
</SCRIPT>
</Head>
<Body bgcolor = "# ffffff" text = "#000000" topmargin = "5">
<Form name = "base" method = "Post" target = "_ Self">
<Input type = "text" name = "name" value = "" size = "30">
<A href = "javascript: Submit ()"> submit </a>
</Form>
<%
If (action! = NULL & Action. Equals ("sent "))
{
Out. println ("<br> the character you entered is:" + name );
Out. println ("<br> the character you pass through the URL is:" + Str );
}
%>
</Body>
</Html>

Because most of the local test environments are tomcat, their Chinese problems are also attached.
Tomcat Chinese --
I found in tomcat5 that the previous method for processing tomcat4 could not be used to process requests submitted directly through URLs. I finally found the most perfect solution for finding information on the Internet, so I don't need to change it everywhere, and both get and post operations are normal. I wrote a document and posted it to people who want to have the same problems as me :-)
Problem description:
1. For the data submitted in the form, the string returned by request. getparameter ("XXX") is garbled or ??
2 directly through URL such as http: // localhost/a. jsp? Name = China, such a GET request is used on the server. garbled characters are returned when getparameter ("name") is getparameter ("name"). Filter setting based on Tomcat 4 is useless or request is used. setcharacterencoding ("GBK ");
Cause:
1 Tomcat J2EE Implementation of Form submission is POST method prompt processing parameters using the default iso-8859-1 to deal
2 Tomcat uses a different processing method than the POST method to process query-string requests submitted in get mode. (Unlike Tomcat 4, setcharacterencoding ("GBK") does not work.
Solution:
First, add the following to all JSP files:
1. Implement a filter. Set the processing character set to GBK. (There is a complete example in the webapps/servlet-examples directory of Tomcat. See the configuration of Web. xml and setcharacterencodingfilter .)

1) Just copy % tomcat installation directory %/webappsservlets-examplesWEB-INFclassesfiltersSetCharacterEncodingFilter.class file to your webapp directory/filters and create one if there is no filters directory.
2) Add the following lines to your web. xml: <filter>
<Filter-Name> set character encoding </filter-Name>
<Filter-class> filters. setcharacterencodingfilter </filter-class>
<Init-param>
<Param-Name> encoding </param-Name>
<Param-value> GBK </param-value>
</Init-param>
</Filter>
<Filter-mapping>
<Filter-Name> set character encoding </filter-Name>
<URL-pattern>/* </url-pattern>
</Filter-mapping>
3) complete.
2. Get Solution
1) Open the Tomcat server. xml file, locate the block, and add the following line:
Uriencoding = "GBK"
The complete information should be as follows:
<Connector
Port = "80" maxthreads = "150" minsparethreads = "25" maxsparethreads = "75"
Enablelookups = "false" redirectport = "8443" acceptcount = "100"
DEBUG = "0" connectiontimeout = "20000"
Disableuploadtimeout = "true"
Uriencoding = "GBK"
/>
2) Restart tomcat. Everything is OK.
Run the following JSP page to test whether the page is successful.
<% @ Page contenttype = "text/html; charset = gb2312" %>
<% @ Page import = "Java. util. *" %>
<%
String q = request. getparameter ("Q ");
Q = q = NULL? "No value": Q;
%>
<HTML>
<Head> <title> display the news list </title>
<Meta http-equiv = Content-Type content = "text/html; charset = gb2312">
<Meta http-equiv = Pragma content = No-Cache>
<Body>
You submitted:
<% = Q %>
<Br>
<Form action = "tcnchar. jsp" method = "Post">
Chinese Input: <input type = "text" name = "Q"> <input type = "Submit" value = "OK">
<Br>
<A href = "tcnchar. jsp? Q = China "> submit via get </a>
</Form>
</Body> If you enter a text box or click the hyperlink, the test result will be displayed: You submitted "China", indicating that it is successful !!!!!

So far, the above articles should meet the needs of the majority of programmers. Here, I would like to thank those programmers who have published excellent articles for our Java programmers on the Internet! Salute them!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More