Explanation of Chinese Characters in Java/jsp

Source: Internet
Author: User

Prerequisites:
1. bytes and Unicode
The Java kernel is Unicode, and even class files are multimedia files, including file/stream storage methods.
It uses byte streams. Therefore, Java needs to convert these bytes through rows. Char is Unicode, while byte is byte.
In Java, byte/Char functions are included in the middle of the sun. Io package. The bytetocharconverter class is medium scheduling,
It can be used to tell you the convertor you are using. Two of the most common static functions are:
Public static bytetocharconverter getdefault ();
Public static bytetocharconverter getconverter (string encoding );
If you do not specify converter, the system automatically uses the current encoding, GBK for the GB platform, and EN platform.
8859_1

Let's take a simple example:
"Your" GB code is: 0xc4e3, Unicode is 0x4f60
You use:
-- Encoding = "gb2312 ";
-- Byte B [] = {(byte) 'u00c1', (byte) 'u00e3 '};
-- Convertor = bytetocharconverter. getconverter (encoding );
-- Char [] C = converter. convertall (B );
-- For (INT I = 0; I <C. length; C ++)
--{
-- System. Out. println (integer. tohexstring (C [I]);
--}
-- Print 0x4f60
-- If 8859_1 encoding is used, the output is
-- 0x00c4, 0x00e3
---- Example 1
Reverse:
-- Encoding = "gb2312 ";
Char C [] = {'u4f60 '};
Convertor = bytetocharconverter. getconverter (encoding );
-- Byte [] B = converter. convertall (C );
-- For (INT I = 0; I <B. length; C ++)
--{
-- System. Out. println (integer. tohexstring (B [I]);
--}
-- The output is 0xc4, 0xe3.
---- Example 2
-- If 8859_1 is used, it is 0x3f ,? Indicates that the conversion cannot be done --
Many Chinese problems are derived from these two simplest classes. But there are many
Encoding input is not directly supported, which brings us a lot of inconvenience. Many Program Rare Encoding
Now, use the default encoding directly, which brings us a lot of difficulties in porting.
--
2. UTF-8
-- UTF-8 is one-to-one correspondence with Unicode, its implementation is very simple
--
-- UNICODE: 0 _______
-- 11-bit Unicode: 1 0 _ 1 0 ______
-- 16-bit Unicode: 1 1 0 _ 1 0 _ 1 0 ______
-- 21-bit Unicode: 1 1 1 0 _ 1 0 _ 1 0 _ 1 0 ______
-- In most cases, only Unicode with less than 16 bits are used:
-- The GB code of "you" is 0xc4e3 and Unicode is 0x4f60
-- We still use the above example.
-- Example 1: 0xc4e3 binary:
-- 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1
-- Because we only have two digits, we can sort them according to their encoding, but we find that this is not feasible,
-- Because 7th bits are not 0, "?" is returned "? "
----
-- Example 2: 0x4f60 binary:
-- 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0
-- We use the UTF-8 to complete,:
-- 11100100 10111101 10100000
-- E4--BD -- A0
-- Returns 0xe4, 0xbd, 0xa0
----
3. String and byte []
-- String is actually the core of char [], but bytes must be encoded to be converted to string.
-- String. Length () is actually the length of the char array. If different codes are used
-- Error points can be generated, resulting in hashes and garbled characters.
-- Example:
---- Byte [] B = {(byte) 'u00c1', (byte) 'u00e3 '};
---- String STR = new string (B, encoding );----
---- If encoding = 8859_1, there will be two words, but encoding = gb2312 only has one word ----
-- This problem occurs frequently when processing pages.
4. Reader, writer/inputstream, outputstream
-- The core of reader and writer is Char, and the core of inputstream and outputstream is byte.
-- But the main purpose of reader and writer is to read/write Char to inputstream/outputstream
-- Example of a reader:
--The test.txt file has only one "you" word, 0xc4, 0xe3 --
-- String encoding =;
-- Inputstreamreader reader = new inputstreamreader (
---- New fileinputstream ("text.txt"), encoding );
-- Char [] C = new char [10];
-- Int length = reader. Read (C );
-- For (INT I = 0; I <C. length; I ++)
---- System. Out. println (C [I]);
-- If encoding is gb2312, there is only one character. If encoding = 8859_1, there are two characters
--------
--
--

----
2. We need to understand the Java compiler:
-- Javac-Encoding
We often do not use the encoding parameter. In fact, the encoding parameter is very important for cross-platform operations.
If encoding is not specified, the default encoding of the system is used. The GB platform is gb2312, and the English platform is iso8859_1.
-- The Java compiler actually calls the sun. Tools. javac. Main class to compile the file. This class --
There is an encoding variable in the compile function. The-encoding parameter is actually directly transmitted to the encoding variable.
The compiler is based on this variable to read the Java file, and then compiled into a class file in the form of UTF-8.
Example:
-- Public void test ()
--{
---- String STR = "you ";
---- Filewriter write = new filewriter ("test.txt ");
---- Write. Write (STR );
---- Write. Close ();
--}
---- Example 3
-- If you use gb2312 for compilation, you will find the E4 BD A0 field.
--
-- If 8859_1 is used for compilation,
-- 00c4 00e3 binary:
-- 00000000 11000100 00000000 11100011 --
-- Because each character must be greater than 7 characters, 11-bit encoding is used:
-- 11000001 10000100 11000011 10100011
-- C1 -- 84 -- C3 -- A3
-- You will find C1 84 C3 A3 --

However, we often ignore this parameter, so there are often cross-platform problems:
-- Example 3: compile on the Chinese platform to generate zhclass
-- Example 3: compile on the English platform and output the enclass
-- 1. zhclass runs OK on the Chinese platform, but not on the English platform.
-- 2. enclass runs OK on the English platform, but not on the Chinese Platform
Cause:
-- 1. After compiling on the Chinese platform, the char [] in the running state of STR is 0x4f60 ,----
-- Run on the Chinese platform. The default filewriter encoding is gb2312. Therefore
-- Chartobyteconverter will automatically call the converter of gb2312 to convert Str
-- Input bytes to fileoutputstream, so 0xc4 and 0xe3 are put into the file.
-- However, on the English platform, the default value of chartobyteconverter is 8859_1,
-- Filewriter will automatically call 8859_1 to convert STR, but he cannot explain it, so he will
-- Output "? "----
-- 2. After compiling on the English platform, the char [] in the running state of STR is 0x00c4 0x00e3 ,----
-- It runs on the Chinese platform and cannot be identified by Chinese characters, so it will appear ??
-- On the English platform, 0x00c4 --> 0xc4, 0x00e3-> 0xe3, so 0xc4 and 0xe3 are put
-- File
----
1. jsp text explanation:
-- Tomcat first checks whether there is a "<% @ page include symbol in your leaf. Yes, then in the same
-- Response. setcontenttype (...) is set in the place. It is read according to encoding, but not according to 8859_1.
-- Read the file, and then write the. Java file in the UTF-8, and then read the file with sun. Tools. Main,
-- (Of course it uses the UTF-8 to read) and then compiled into a class file
-- Setcontenttype changes the out attribute. The default encoding of the out variable is 8859_1.

2. Explanation of Parameter
-- Unfortunately, parameter only explains iso8859_1. This material can be implemented in servlet.Code.

3. Interpretation of include
Format, but unfortunately, because the person who writes "org. Apache. Jasper. compiler. parser"
In the array jsputil. validattriding [], the parameter "encoding" is not added, resulting in no support.
In this way. You can compileSource codeAnd encoding support

Summary:

If you are under NT, the simplest method is to cheat Java without adding any encoding variable:
<HTML>
Hello <% = request. getparameter ("value") %>
</Html>

Http: // localhost/test. jsp? Value = You

Result: Hello

However, this method has many limitations, suchArticleSegmentation, this approach is dead, the best
This solution is used:
<% @ Page contenttype = "text/html; charset = gb2312" %>
<HTML>
Hello <% = new string (request. getparameter ("value"). getbytes ("8859_1"), "gb2312") %>
</Html>

Must read the article, but the solution is not flattering

--------------------------------------------------------------------------------

1. The get method is not recommended for webpage parameters, and you can adjust whether to use UTF-8 for sending.
2. it is recommended that JSP should not be used. In fact, adding or not to this sentence provides a scheme to achieve normal display of Chinese characters. I don't think it is convenient, at least I don't need to write the code, I think the following configuration can display Chinese properly:
A. compile all the JavaBean with iso8859-1
B. Do not write the above charset = gb2312 statement in the JSP file (if it is written, it is wrong)

In the case of Tomcat, pay attention to the above two points. For other JSP servers that may not work, add the following
C. The operating system language on the server is set to English (for example, Linux without a bluepoint-like Chinese system is Originally in English)
Just now ---

If not, please report ....

Re: you must read the article, but the solution is not flattering.

--------------------------------------------------------------------------------

Tomcat parameters are encoded with 8859_1 for both get and post methods. The source code of Tomcat Servlet is as follows:
A) for the post method
Javax. servlet. http. httputils's parsepostdata method: (for post form data)
String postedbody = new string (postedbytes, 0, Len, "8859_1");) There is no problem because % is used to describe Chinese characters. However, the parsename function does not integrate things in Chinese. It is just a simple patchwork. Therefore, it can be determined that it uses the 8859_1 encoding rule.
SB. append (char) integer. parseint (S. substring (I + 1, I + 3), 16 ));
---- I + = 2;
--
B) For the get Method
Org. Apache. tomcat. Service. http. httprequestadapter
-- Line = new string (BUF, 0, count,
Constants. characterencoding. Default );
---- Constants. characterencoding. Default = 8859_1
This code is not easy to track and should never be confused by the illusion. Httprequestadapter is derived from requestimpl. However, in fact, the server with port 8080 does not directly use requestimpl, but uses httprequestadapter to obtain querystring

I keep my comments on adding without encoding, because if you want to solve the problem of uploading file paging, you must use it for encoding. In addition, the encoding ensures the transmission in some beans.

I want to explain it here.

--------------------------------------------------------------------------------

Tomcat is only a standard implementation of jsp1.1 and servlet2.2. We should not require this free software to cover all aspects of detail and performance. It mainly considers English users, this is also the reason why there is a problem with the URL method transfer of Chinese characters without special conversion. Most of our Browser IE's advanced settings always use UTF-8 to send URL options by default, if this is a tomcat bug, it is also possible. In addition, no matter what language the current operating system is, Tomcat seems to compile JSP according to iso8859. I think it is a bit defective, but in any case, the implementation of new standards and popular software always consider English in terms of language support.

What should I say about my solution?
1. or that sentence, the software of the English country is always first consider English, Java virtual machine specification requires the virtual machine must implement iso8859, Unicode, UTF-8 three, other do not require, this is the way we use virtual machines in JDK, so embedded ones are not needed. That is to say, other encodes may not be directly supported by the Java Virtual Machine, and our Chinese characters are naturally not listed in it, if an external package supports conversion, Sun JDK should be in i18n. in jar, iso8859 is the fastest, and I/O operations for reading packages are not needed.
2. At least write less code, no additional operations, and concise style.
3. I wrote a JSP + JavaBeans chat room software (not using Servlet, JSP is really good ), in the same program, Americans use their browsers to access the English interface, and Chinese to enter the Chinese interface. If charset = gb2312 is added, it is at least troublesome.
4. gb2312 is limited. If you want to use GBK, what should you do? No better character set, no matter what the character set is, as long as what I set in the current browser is, I can display it.

Summary: Regardless of speed, development efficiency, and scalability, my solution is better than yours. In addition, I cannot find a better solution than mine.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.