To 《
Http://www.javaresearch.org
Website!Owen1944 original
This is a summary of some solutions and experiences for Chinese Garbled text!
Http://www.javaresearch.org/article/showarticle.jsp? Column = 331 & Thread = 8065 & START = 15 & msrange = 15
1. bytes and Unicode
The Java kernel is Unicode, and even class files are multimedia files. The storage methods of files/streams are byte streams. Therefore, Java needs to convert these bytes through rows. Char is Unicode, while byte is byte. In Java, byte/Char functions are included in the middle of the sun. Io package. Here, the bytetocharconverter class is in scheduling and can be used to tell you the convertor you are using. Two of the most common static functions are:
Public static bytetocharconverter getdefault ();
Public static bytetocharconverter getconverter (string encoding );
If you do not specify converter, the system will automatically use the current encoding, GBK for the GB platform, and 8859_1 for the en platform.
Byte --> CHAR:
"Your" GB code is: 0xc4e3, Unicode is 0x4f60
String encoding = "gb2312 ";
Byte B [] = {(byte) '/u00c4', (byte) '/u00e3 '};
Bytetocharconverter converter = bytetocharconverter. getconverter (encoding );
Char C [] = converter. convertall (B );
For (INT I = 0; I <C. length; I ++ ){
System. Out. println (integer. tohexstring (C [I]);
}
What is the result? 0x4f60
If encoding = "8859_1", what is the result? 0x00c4, 0x00e3
If the code is changed
Byte B [] = {(byte) '/u00c4', (byte) '/u00e3 '};
Bytetocharconverter converter = bytetocharconverter. getdefault ();
Char C [] = converter. convertall (B );
For (INT I = 0; I <C. length; I ++ ){
System. Out. println (integer. tohexstring (C [I]);
}
What is the result? Depends on the platform code.
Char --> byte:
String encoding = "gb2312 ";
Char C [] = {'/u4f60 '};
Chartobyteconverter converter = chartobyteconverter. getconverter (encoding );
Byte B [] = converter. convertall (C );
For (INT I = 0; I <B. length; I ++ ){
System. Out. println (integer. tohexstring (B [I]);
}
What is the result? 0x00c4, 0x00e3
If encoding = "8859_1", what is the result? 0x3f
If the code is changed
String encoding = "gb2312 ";
Char C [] = {'/u4f60 '};
Chartobyteconverter converter = chartobyteconverter. getdefault ();
Byte B [] = converter. convertall (C );
For (INT I = 0; I <B. length; I ++ ){
System. Out. println (integer. tohexstring (B [I]);
}
What is the result? Depends on the platform code.
Many Chinese problems are derived from these two simplest classes. However, many classes do not directly support the encoding input, which brings us a lot of inconvenience. Many programs rarely use encoding and use default encoding directly, which brings us a lot of difficulties in porting.
2. UTF-8
UTF-8 is one-to-one correspondence with Unicode, and its implementation is very simple.
UNICODE: 0 _______
11-bit Unicode: 1 0 _ 1 0 ______
16-bit Unicode: 1 1 0 _ 1 0 _ 1 0 ______
21-bit Unicode: 1 1 1 1 0 _ 1 0 _ 1 0 _ 1 0 ______
In most cases, only Unicode with less than 16 bits are used:
"Your" GB code is: 0xc4e3, Unicode is 0x4f60
0xc4e3 binary:
1100,010 0, 1110,001 1
Because only two digits are encoded by two digits, we find that this does not work, because 7th bits are not 0, so we return "? "
0x4f60 binary:
0100,111 1, 0110,000 0
We use UTF-8 to complete and convert it:
1110,010 0, 1011,110 1, 1010,000 0
E4 -- BD -- A0
Returns 0xe4, 0xbd, and 0xa0.
3. String and byte []
In fact, the core of string is Char []. To convert byte to string, it must be encoded. String. Length () is actually the length of the char array. If different encodings are used, it is likely to be divided incorrectly, resulting in hashes and garbled characters.
For example:
String encoding = "";
Byte [] B = {(byte) '/u00c4', (byte) '/u00e3 '};
String STR = new string (B, encoding );
If encoding = 8859_1, there will be two words, but encoding = gb2312 only has one word. This problem often occurs when processing pages.
4. Reader, writer/inputstream, outputstream
The core of reader and writer is Char, and the core of inputstream and outputstream is byte. However, the main purpose of reader and writer is to read/write Char to inputstream/outputstream.
For example:
The test.txt file contains only one "you" word, 0xc4, 0xe3
String encoding = "gb2312 ";
Inputstreamreader reader = new inputstreamreader (New fileinputstream (
"Text.txt"), encoding );
Char C [] = new char [10];
Int length = reader. Read (C );
For (INT I = 0; I <length; I ++ ){
System. Out. println (C [I]);
}
What is the result? You
If encoding = "8859_1", what is the result ??? Two characters, indicating not recognized.
In turn, the example is done by yourself.
5. We need to understand the Java compiler.:
Javac? Encoding
We often do not use the encoding parameter. In fact, the encoding parameter is very important for cross-platform operations. If encoding is not specified, the default encoding of the system is used. The GB platform is gb2312, and the English platform is iso8859_1.
The Java compiler actually calls sun. tools. javac. main class: Compile the file. This class has an encoding variable in the compile function. The-encoding parameter is actually directly transmitted to the encoding variable. The compiler reads the Java file based on this variable and then compiles it into a class file in UTF-8 format.
Sample Code:
String STR = "you ";
Filewriter writer = new filewriter ("text.txt ");
Write. Write (STR );
Writer. Close ();
If you use gb2312 for compilation, you will find the E4 BD A0 field;
If 8859_1 is used for compilation, the binary format of 00c4 00e3 is:
Because each character must be greater than 7 characters, 11-bit encoding is used:
C1 -- 84 -- C3 -- A3
You will find C1 84 C3 A3.
However, we often ignore this parameter, so there are often cross-platform problems:
Sample Code is compiled on the Chinese platform to generate zhclass
The sample code is compiled on the English platform and the output is enclass.
(1) zhclass runs OK on the Chinese platform, but not on the English platform.
(2). enclass runs OK on the English platform, but not on the Chinese platform.
Cause:
(1 ). after compiling on the Chinese platform, STR is actually running in char [] 0x4f60, and runs on the Chinese platform. The default encoding of filewriter is gb2312, therefore, chartobyteconverter automatically uses the converter that calls gb2312 to convert STR to byte and input it to fileoutputstream. Therefore, 0xc4 and 0xe3 are put into the file.
But on the English platform, if the default value of chartobyteconverter is 8859_1, filewriter will automatically call 8859_1 to convert STR, but he cannot explain it, so he will output "? "
(2). After compiling on the English platform, STR is actually running in char [] 0x00c4 0x00e3 on the Chinese platform. Chinese characters cannot be identified, so ??;
On the English platform, 0x00c4 --> 0xc4, 0x00e3-> 0xe3, so 0xc4 and 0xe3 are put into the file.
6. Other reasons:<% @ Page contenttype = "text/html; charset = GBK" %>
Set the display encoding of the browser. If the response data is UTF-8 encoded, the display will be garbled, but the Garbled text is different from the above.
7. Encoding:
From database to Java program byte --> char
From Java program to database char --> byte
From file to Java program byte --> char
From Java program to file char --> byte
Display char --> byte from Java program to page
Submit data from page form to Java program byte --> char
Byte --> char
From Java program to stream char --> byte
Xie Zhigang's solution:
I used the configuration filter to solve Chinese garbled characters:
<Web-app>
<Filter>
<Filter-Name> requestfilter </filter-Name>
<Filter-class> net. Golden. uirs. util. requestfilter </filter-class>
<Init-param>
<Param-Name> charset </param-Name>
<Param-value> gb2312 </param-value>
</Init-param>
</Filter>
<Filter-mapping>
<Filter-Name> requestfilter </filter-Name>
<URL-pattern> *. jsp </url-pattern>
</Filter-mapping>
</Web-app>
Public void dofilter (servletrequest req, servletresponse res,
Filterchain fchain) throws ioexception, servletexception {
Httpservletrequest request = (httpservletrequest) req;
Httpservletresponse response = (httpservletresponse) RES;
Httpsession session = request. getsession ();
String userid = (string) Session. getattribute ("userid ");
Req. setcharacterencoding (this. filterconfig. getinitparameter ("charset"); // set the character set?
In fact, the encoding of byte --> char is set.
Try {
If (userid = NULL | userid. Equals ("")){
If (! Request. getrequesturl (). tostring (). Matches (". */uirs/logon (Controller) {0, 1} // x2ejsp $ ")){
Session. invalidate ();
Response. sendredirect (request. getcontextpath () + "/uirs/logon. jsp ");
}
}
Else {// check whether you have the permission to report information to the System
If (! Net. Golden. uirs. util. uirschecker. Check (userid, "information reporting system ",
Net. Golden. uirs. util. uirschecker. action_do )){
If (! Request. getrequesturl (). tostring (). Matches (". */uirs/logon (Controller) {0, 1} // x2ejsp $ ")){
Response. sendredirect (request. getcontextpath () + "/uirs/logon/logoncontroller. jsp ");
}
}
}
}
Catch (exception ex ){
Response. sendredirect (request. getcontextpath () + "/uirs/logon. jsp ");
}
Fchain. dofilter (req, Res );
}