JAVA/JSP Chinese problem

JAVA/JSP Chinese problem _jsp programming

Last Update:2017-01-18 Source: Internet

Author: User

Tags require tomcat

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preliminary knowledge:
1. Bytes and Unicode
The Java kernel is Unicode, even the class file, but many media, including how files/streams are saved
Is the use of Word throttling. So Java wants to transform these bytes through the rows. Char is Unicode, and byte is byte.
The function of Byte/char in Java is in the middle of Sun.io's package. Where the Bytetocharconverter class is in the dispatch,
Can be used to tell you that you use the convertor. Two of the most common static functions are
public static Bytetocharconverter Getdefault ();
public static Bytetocharconverter Getconverter (String encoding);
If you do not specify converter, the system will automatically use the current ENCODING,GB platform on the Gbk,en platform
8859_1
　　
Let's take a simple example:
"You" The GB code is: 0XC4E3, Unicode is 0x4f60
You use:
--encoding= "gb2312";
--byte b[]={(Byte) ' U00c4 ', (byte) ' U00e3 '};
--convertor=bytetocharconverter.getconverter (encoding);
--char [] C=converter.convertall (b);
--for (int i=0;i<c.length;c++)
--{
--System.out.println (integer.tohexstring (c[i));
--}
--Print out is 0X4F60
--but if you use 8859_1 's code, print it out
--0x00c4,0x00e3
----Case 1
Turn:
--encoding= "gb2312";
Char c[]={' u4f60 '};
Convertor=bytetocharconverter.getconverter (encoding);
--byte [] B=converter.convertall (c);
--for (int i=0;i<b.length;c++)
--{
--System.out.println (integer.tohexstring (b[i));
--}
--Print out is: 0xc4,0xe3
----Case 2
If using 8859_1 is 0x3f, it means that it cannot be transformed-
Many of the Chinese problems are derived from these two simplest classes. And yet there are many classes
Do not directly support the encoding input, which brings us a lot of inconvenience. Many programs rarely use encoding
, directly using Default's encoding, which brings us a lot of difficulties in transplanting.
--
2.utf-8
--utf-8 is corresponding to Unicode one by one, and its implementation is simple
--
--7-bit Unicode:0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
--11-bit Unicode:1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _
--16-bit Unicode:1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
--21-bit unicode:1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
-In most cases, only Unicode with the following 16 digits is used:
--"You" The GB code is: 0XC4E3, Unicode is 0x4f60
--We still use the example above
The binary system of the----example 1:0xc4e3:
----1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1
----because there are only two of us in line with the two-bit code, but we find it unworkable,
----because 7th is not 0 so, back to "?"
--　　--
The binary system of the----example 2:0x4f60:
----0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0
----we use UTF-8 to make up, become:
----11100100 10111101 10100000
----e4--bd--A0
----then returned to 0xe4,0xbd,0xa0
--　　--
3.String and byte[]
--string In fact the core is char[], however, to convert byte to string, must be encoded.
--string.length () is actually the length of the char array, and if you use a different encoding, you can
-Can be wrong points, resulting in scattered characters and garbled.
--Example:
----Byte [] b={(byte) ' U00c4 ', (byte) ' U00e3 '};
----String Str=new string (b,encoding); ----
----If encoding=8859_1, there will be two words, but encoding=gb2312 only one word----
--This problem is often occurred in the process of paging
4.reader,writer/inputstream,outputstream
--reader and writer cores are the Char,inputstream and outputstream cores are byte.
But the main purpose of reader and writer is to read/write char Inputstream/outputstream
--An example of reader:
--File Test.txt only a "you" word, 0xc4,0xe3--
--string encoding=;
--inputstreamreader reader=new InputStreamReader (
----New FileInputStream ("Text.txt"), encoding);
--char []c=new char[10];
--int Length=reader.read (c);
--for (int i=0;i<c.length;i++)
----SYSTEM.OUT.PRINTLN (c[i]);
--If encoding is gb2312, then there is only one character, and if encoding=8859_1, there are two characters
--------
--
--
　　
----
2. We want to know something about Java compilers:
--javac-encoding
We often do not use encoding this parameter. In fact, encoding this parameter is very important for cross-platform operation.
If encoding is not specified, it is gb2312 on the default ENCODING,GB platform on the system, iso8859_1 on the English platform.
The--java compiler actually invokes the Sun.tools.javac.Main class, compiles the file, and this class--
There is a encoding variable in the middle of the compile function, and the-encoding parameter is actually passed directly to the encoding variable.
The compiler reads the Java file based on this variable and then compiles the UTF-8 form into a class file.
An example:
--public void Test ()
--{
----String str= "You";
----FileWriter write=new FileWriter ("test.txt");
----Write.write (str);
----Write.close ();
--}
----Case 3
--If you compile with gb2312, you'll find the E4 BD A0 field
--
--if compiled with 8859_1,
--00C4 00E3 binary:
--00000000 11000100 00000000 11100011--
-Because each character is greater than 7 bits, it is encoded with 11-bit encoding:
--11000001 10000100 11000011 10100011
--c1--84--c3--A3
--You'll find C1 C3 A3--
　　　　
But we tend to ignore this argument, so there's always a cross-platform problem:
--Example 3 compiled on Chinese platform to generate Zhclass
--Example 3 compiled on English platform, output enclass
--1. Zhclass on Chinese platform OK, but not on English platform
--2. Enclass on English platform OK, but not on the Chinese platform
Reason:
--1 in the Chinese platform after compiling, in fact, str in the running state of char[] is 0x4f60,----
--run on the Chinese platform, FileWriter's default encoding is gb2312, so
--chartobyteconverter will automatically convert str by invoking the converter of gb2312
--entered into a byte into the FileOutputStream, so 0xc4,0xe3 into the file.
--but if it is in the English platform, the default value of Chartobyteconverter is 8859_1,
--filewriter will automatically invoke 8859_1 to convert str, but he can't explain, so he'll
--Output "?"----
--2. In the English platform after compiling, in fact, str in the running state of char[] is 0x00c4 0x00e3,----
--run on the Chinese platform, Chinese can not be recognized, so it will appear??
--On the English platform, 0x00c4-->0xc4,0x00e3->0xe3, so 0xc4,0xe3 was put into the
--File
----
1. For the JSP body explanation:
--tomcat first Look at the "<% @page include symbols in your foliage. There are, then in the same
--Local setting Response.setcontenttype (..); Read according to encoding, without him following 8859_1
--Read the file, then write the. java file with UTF-8, then read the file with Sun.tools.Main,
--(Of course it uses UTF-8 to read), and then compiled into a class file
The--setcontenttype change is out of the property, the out variable default encoding is 8859_1

2. Interpretation of the parameter
Unfortunately parameter only iso8859_1 explanation, this material can be found in the servlet implementation code.

3. Interpretation of include
Format, but unfortunately, because of the guy who wrote "Org.apache.jasper.compiler.Parser."
In array jsputil.validattribute[] forgot to add a parameter: encoding, which causes the
Hold this way. You can completely compile the source code, plus support for encoding.

Summarize:

If you're under NT, the easiest way to do this is to trick Java, without any encoding variables:
Hello <%=request.getparameter ("value")%>

Http://localhost/test/test.jsp?value= You

Result: Hello you

But this method is very limited, such as to upload the article section, such a practice is dead, the best
The solution is to use this scenario:
<%@ page contenttype= "text/html;charset=gb2312"%>
Hello <%=new String (request.getparameter ("value"). GetBytes ("8859_1"), "gb2312")%>

Must read the good text, but the solution is not flattering

--------------------------------------------------------------------------------

1. Web pass parameters do not advocate using Get method, and users can adjust whether to send with Utf-8
2. Recommended JSP is best not to use, actually add this sentence has to achieve Chinese normal display of the program, I think it is not convenient, at least do not write these code, such as the configuration I think can make Chinese normal display:
A. All JavaBean are compiled with iso8859-1
Do not write the above charset=gb2312 statement in the b.jsp file (write instead of wrong)

In Tomcat case, note that the above 2 points on the line---, and so on, for other possible JSP server, plus the following
C. The operating system language on the server is set to English (Linux, like a bluepoint Chinese system, is generally English)
Just---.

If anyone is not right, please report ....

Re: Must read the good text, but the solution is not flattering

--------------------------------------------------------------------------------

Tomcat's parameter problem is encoded by 8859_1, either get or post. This can look at the source code implemented by the Tomcat servlet:
A) for post methods
Javax.servlet.http.HttpUtils Parsepostdata Method: (for post form data)
String postedbody = new String (postedbytes, 0, Len, "8859_1");) There is no problem here because the Chinese will be explained in%. But parsename this function, but does not combine the thing that is Chinese, he is simply pieced together, so can assume that he is using 8859_1 coding rules
Sb.append ((char) integer.parseint (s.substring (i+1, i+3), 16));
----i = 2;
--
b) for Get methods
Org.apache.tomcat.service.http.HttpRequestAdapter
--Line=new String (buf, 0, Count,
Constants.CharacterEncoding.Default);
----Constants.characterencoding.default=8859_1
This code is not easy to follow, do not be fooled by some illusion. The Httprequestadapter is derived from the Requestimpl. However, actually using the 8080 port server does not use the Requestimpl directly, but uses the Httprequestadapter to obtain the querystring

For add-no encoding, I reserve my opinion, because if you want to solve the problem of uploading file paging, you must encode it with him. and coding can guarantee the transitivity in some beans.

Looks like I'm going to explain it here.

--------------------------------------------------------------------------------

Tomcat is just a standard implementation of the jsp1.1,servlet2.2, we should not require this free software on both the meticulous and performance of all aspects, it is the main consideration of the English users, which is why not make special conversion of our Chinese characters using URL method to pass the cause of the problem, Most of our browser ie its advanced settings always send the URL in Utf-8 the default is selected, if this is a Tomcat bug is also possible, in addition to Tomcat regardless of the current operating system is what language, as if all press iso8859 to compile JSP, I think also a bit defective, But anyway, the implementation of the new standards and the popular software in language support is always the first to consider English

My plan, what's better?
1. Or that sentence, the English-speaking country's software is always the first to consider English, Java Virtual Machine specification requirements within the virtual machine must implement Iso8859,unicode,utf-8 three, the other does not require, we use the JDK in the virtual machine is so, embedded is not to mention, That is, other encode are probably not directly supported by the Java Virtual machine, our Chinese is not in its column, the need for external package support conversion, the sun JDK should be in the I18n.jar, with iso8859 speed is the fastest, do not need other calls and exchange anything, More IO operations with no read packets
2. At least write less code, no extra operation, concise style who don't like
3. The written JSP page is internationalized Well, I wrote a jsp+javabeans chat room software (no use of servlet,jsp really good), the same program the Americans use their browser to enter the English interface, Chinese access is the Chinese interface, If you add charset=gb2312, it's at least troublesome.
4. Limit the gb2312, if the user to use GBK, how to do, do not add better, no matter what the character set, as long as my current browser settings are, I can show

Summary: Regardless of speed, development efficiency, and scalability considerations, my plan is better than yours, in addition, I can not find your plan than my good place.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More