A second discussion on Java Chinese problems

Source: Internet
Author: User
Tags array include integer variable tomcat
Question | Chinese I'm going to say how Tomcat realizes the JSP you know.
Preliminary knowledge:
1. Bytes and Unicode
The Java kernel is Unicode, even the class file, but many media, including how files/streams are saved
Is the use of Word throttling. So Java wants to transform these bytes through the rows. Char is Unicode, and byte is byte.
The function of Byte/char in Java is in the middle of Sun.io's package. Where the Bytetocharconverter class is in the dispatch,
Can be used to tell you that you use the convertor. Two of the most common static functions are
public static Bytetocharconverter Getdefault ();
public static Bytetocharconverter Getconverter (String encoding);
If you do not specify converter, the system will automatically use the current ENCODING,GB platform on the Gbk,en platform
8859_1

Let's take a simple example:
"You" The GB code is: 0XC4E3, Unicode is 0x4f60
You use:
--encoding= "gb2312";
--byte b[]={(Byte) ' \u00c4 ', (byte) ' \u00e3 '};
--convertor=bytetocharconverter.getconverter (encoding);
--char [] C=converter.convertall (b);
--for (int i=0;i<c.length;c++)
--{
--System.out.println (integer.tohexstring (c[i));
--}
--Print out is 0X4F60
--but if you use 8859_1 's code, print it out
--0x00c4,0x00e3
----Case 1
Turn:
--encoding= "gb2312";
Char c[]={' \u4f60 '};
Convertor=bytetocharconverter.getconverter (encoding);
--byte [] B=converter.convertall (c);
--for (int i=0;i<b.length;c++)
--{
--System.out.println (integer.tohexstring (b[i));
--}
--Print out is: 0xc4,0xe3
----Case 2
If using 8859_1 is 0x3f, it means that it cannot be transformed-
Many of the Chinese problems are derived from these two simplest classes. And yet there are many classes
Do not directly support the encoding input, which brings us a lot of inconvenience. Many programs rarely use encoding
, directly using Default's encoding, which brings us a lot of difficulties in transplanting.
--
2.utf-8
--utf-8 is corresponding to Unicode one by one, and its implementation is simple
--
--7-bit Unicode:0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
--11-bit Unicode:1 1 0 _ _ _ _ _ 1 0 _ _ _ _ _
--16-bit Unicode:1 1 1 0 _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
--21-bit unicode:1 1 1 1 0 _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ 1 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
-In most cases, only Unicode with the following 16 digits is used:
--"You" The GB code is: 0XC4E3, Unicode is 0x4f60
--We still use the example above
The binary system of the----example 1:0xc4e3:
----1 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1
----because there are only two of us in line with the two-bit code, but we find it unworkable,
----because 7th is not 0 so, back to "?"
--  --
The binary system of the----example 2:0x4f60:
----0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0
----we use UTF-8 to make up, become:
----11100100 10111101 10100000
----e4--bd--A0
----then returned to 0xe4,0xbd,0xa0
--  --
3.String and byte[]
--string In fact the core is char[], however, to convert byte to string, must be encoded.
--string.length () is actually the length of the char array, and if you use a different encoding, you can
-Can be wrong points, resulting in scattered characters and garbled.
--Example:
----Byte [] b={(byte) ' \u00c4 ', (byte) ' \u00e3 '};
----String Str=new string (b,encoding); ----
----If encoding=8859_1, there will be two words, but encoding=gb2312 only one word----
--This problem is often occurred in the process of paging
4.reader,writer/inputstream,outputstream
--reader and writer cores are the Char,inputstream and outputstream cores are byte.
But the main purpose of reader and writer is to read/write char Inputstream/outputstream
--An example of reader:
--File Test.txt only a "you" word, 0xc4,0xe3--
--string encoding=;
--inputstreamreader reader=new InputStreamReader (
----New FileInputStream ("Text.txt"), encoding);
--char []c=new char[10];
--int Length=reader.read (c);
--for (int i=0;i<c.length;i++)
----SYSTEM.OUT.PRINTLN (c[i]);
--If encoding is gb2312, then there is only one character, and if encoding=8859_1, there are two characters
--------
--
--

----
2. We want to know something about Java compilers:
--javac-encoding
We often do not use encoding this parameter. In fact, encoding this parameter is very important for cross-platform operation.
If encoding is not specified, it is gb2312 on the default ENCODING,GB platform on the system, iso8859_1 on the English platform.
The--java compiler actually invokes the Sun.tools.javac.Main class, compiles the file, and this class--
There is a encoding variable in the middle of the compile function, and the-encoding parameter is actually passed directly to the encoding variable.
The compiler reads the Java file based on this variable and then compiles the UTF-8 form into a class file.
An example:
--public void Test ()
--{
----String str= "You";
----FileWriter write=new FileWriter ("test.txt");
----Write.write (str);
----Write.close ();
--}
----Case 3
--If you compile with gb2312, you'll find the E4 BD A0 field
--
--if compiled with 8859_1,
--00C4 00E3 binary:
--00000000 11000100 00000000 11100011--
-Because each character is greater than 7 bits, it is encoded with 11-bit encoding:
--11000001 10000100 11000011 10100011
--c1--84--c3--A3
--You'll find C1 C3 A3--

But we tend to ignore this argument, so there's always a cross-platform problem:
--Example 3 compiled on Chinese platform to generate Zhclass
--Example 3 compiled on English platform, output enclass
--1. Zhclass on Chinese platform OK, but not on English platform
--2. Enclass on English platform OK, but not on the Chinese platform
Reason:
--1 in the Chinese platform after compiling, in fact, str in the running state of char[] is 0x4f60,----
--run on the Chinese platform, FileWriter's default encoding is gb2312, so
--chartobyteconverter will automatically convert str by invoking the converter of gb2312
--entered into a byte into the FileOutputStream, so 0xc4,0xe3 into the file.
--but if it is in the English platform, the default value of Chartobyteconverter is 8859_1,
--filewriter will automatically invoke 8859_1 to convert str, but he can't explain, so he'll
--Output "?"----
--2. In the English platform after compiling, in fact, str in the running state of char[] is 0x00c4 0x00e3,----
--run on the Chinese platform, Chinese can not be recognized, so it will appear??
--On the English platform, 0x00c4-->0xc4,0x00e3->0xe3, so 0xc4,0xe3 was put into the
--File
----
1. For the JSP body explanation:
--tomcat first Look at the "<% @page include symbols in your foliage. There are, then in the same
--Local setting Response.setcontenttype (..); Read according to encoding, without him following 8859_1
--Read the file, then write the. java file with UTF-8, then read the file with Sun.tools.Main,
--(Of course it uses UTF-8 to read), and then compiled into a class file
The--setcontenttype change is out of the property, the out variable default encoding is 8859_1

2. Interpretation of the parameter
Unfortunately parameter only iso8859_1 explanation, this material can be found in the servlet implementation code.

3. Interpretation of include
Format, but unfortunately, because of the guy who wrote "Org.apache.jasper.compiler.Parser."
In array jsputil.validattribute[] forgot to add a parameter: encoding, which causes the
Hold this way. You can completely compile the source code, plus support for encoding.

Summarize:

If you're under NT, the easiest way to do this is to trick Java, without any encoding variables:
Hello <%=request.getparameter ("value")%>

Http://localhost/test/test.jsp?value= You

Result: Hello you

But this method is very limited, such as to upload the article section, such a practice is dead, the best
The solution is to use this scenario:
<%@ page contenttype= "text/html;charset=gb2312"%>
Hello <%=new String (request.getparameter ("value"). GetBytes ("8859_1"), "gb2312")%>


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.