Java solutions and experiences on Chinese Garbled text

Source: Internet
Author: User

In Java-based programming, Chinese characters are frequently encountered and displayed, such as a lot of garbled characters or question marks.
This is because the default encoding method in Java is Unicode, and the files and DB commonly used by Chinese people are based on gb2312 or big5 encoding, so this problem occurs. In the past, I often worried about this problem. After checking some information, I finally solved the problem. I know that many of my friends will also encounter this problem, so I just summarized it, to share with you.

1. Output Chinese characters on the webpage.
Java is used in network transmission encoding is "ISO-8859-1", so in the output need to be converted, such:
String STR = "Chinese ";
STR = new string (Str. getbytes ("gb2312"), "8859_1 ");
However, if the encoding used during program compilation is "gb2312" and the program runs on the Chinese platform, this problem will not occur, so pay attention to it.

2. Read Chinese from Parameters
This is exactly the opposite of the output in the webpage:
STR = new string (Str. getbytes ("8859_1"), "gb2312 ");

3. Questions about how to operate databases in Chinese
A simple method is to set "region" to "English (USA)" in "control plane )". If garbled characters still occur, you can perform the following settings:
When retrieving Chinese characters: Str = new string (Str. getbytes ("gb2312 "));
Input Chinese characters to DB: Str = new string (Str. getbytes ("ISO-8859-1 "));

4. Solutions for Chinese in JSP:

In "control plane", set "region" to "English (USA )".
Add:
If not, perform the following conversion:
For example: Name = new string (name. getbytes ("ISO-8859-1"), "GBK ");
There will be no Chinese problems.

------------------------------------------------------

1. bytes and Unicode

The Java kernel is Unicode, and even class files are multimedia files. The storage methods of files/streams are byte streams. Therefore, Java needs to convert these bytes through rows. Char is Unicode, while byte is byte. In Java, byte/Char functions are included in the middle of the sun. Io package. Here, the bytetocharconverter class is in scheduling and can be used to tell you the convertor you are using. Two common static functions are:

Public static bytetocharconverter getdefault ();

Public static bytetocharconverter getconverter (string encoding );

If you do not specify converter, the system will automatically use the current encoding, GBK for the GB platform, and 8859_1 for the en platform.

Byte --> CHAR:

"Your" GB code is: 0xc4e3, Unicode is 0x4f60

String encoding = "gb2312 ";

Byte B [] = {(byte) ''/u00c4'', (byte) ''/u00e3 ''};

Bytetocharconverter converter = bytetocharconverter. getconverter (encoding );

Char C [] = converter. convertall (B );

For (INT I = 0; I <C. length; I ++ ){

System. Out. println (integer. tohexstring (C [I]);

}

What is the result? 0x4f60

If encoding = "8859_1", what is the result? 0x00c4, 0x00e3 (this indicates that 8859_1 is in bytes)

If the code is changed:

Byte B [] = {(byte) ''/u00c4'', (byte) ''/u00e3 ''};

Bytetocharconverter converter = bytetocharconverter. getdefault ();

Char C [] = converter. convertall (B );

For (INT I = 0; I <C. length; I ++ ){

System. Out. println (integer. tohexstring (C [I]);

}

What is the result?

This depends on the platform code.

Char --> byte:

String encoding = "gb2312 ";

Char C [] = {''/u4f60 ''};

Chartobyteconverter converter = chartobyteconverter. getconverter (encoding );

Byte B [] = converter. convertall (C );

For (INT I = 0; I <B. length; I ++ ){

System. Out. println (integer. tohexstring (B [I]);

}

What is the result? 0x00c4, 0x00e3

If encoding = "8859_1", what is the result? 0x3f

If the code is changed

String encoding = "gb2312 ";

Char C [] = {''/u4f60 ''};

Chartobyteconverter converter = chartobyteconverter. getdefault ();

Byte B [] = converter. convertall (C );

For (INT I = 0; I <B. length; I ++ ){

System. Out. println (integer. tohexstring (B [I]);

}

What is the result? Depends on the platform code.

Many Chinese problems are derived from these two simplest classes. However, many classes do not directly support the encoding input, which brings us a lot of inconvenience. Many programs rarely use encoding and use default encoding directly, which brings us a lot of difficulties in porting.

Ii. UTF-8

UTF-8 is one-to-one correspondence with Unicode, and its implementation is simple:

UNICODE: 0 _______

11-bit Unicode: 1 0 _ 1 0 ______

16-bit Unicode: 1 1 0 _ 1 0 _ 1 0 ______

21-bit Unicode: 1 1 1 1 0 _ 1 0 _ 1 0 _ 1 0 ______

In most cases, only Unicode with less than 16 bits are used:

"Your" GB code is: 0xc4e3, Unicode is 0x4f60

0xc4e3 binary:

1100,010 0, 1110,001 1

Because only two digits are encoded by two digits, we find that this does not work, because 7th bits are not 0, so we return "? "

0x4f60 binary:

0100,111 1, 0110,000 0

We use UTF-8 to complete and convert it:

1110,010 0, 1011,110 1, 1010,000 0

E4 -- BD -- A0

Returns 0xe4, 0xbd, and 0xa0.

Iii. String and byte []

In fact, the core of string is Char []. To convert byte to string, it must be encoded. String. Length () is actually the length of the char array. If different encodings are used, it is likely to be divided incorrectly, resulting in hashes and garbled characters. For example:

String encoding = "";

Byte [] B = {(byte) ''/u00c4'', (byte) ''/u00e3 ''};

String STR = new string (B, encoding );

If encoding = 8859_1, there will be two words, but encoding = gb2312 only has one word. This problem often occurs when processing pages.

4. Reader, writer/inputstream, and outputstream

The core of reader and writer is Char, and the core of inputstream and outputstream is byte. However, the main purpose of reader and writer is to read/write Char to inputstream/outputstream. For example:

The test.txt file contains only one "you" word, 0xc4, 0xe3

String encoding = "gb2312 ";

Inputstreamreader reader = new inputstreamreader (New fileinputstream (

"Text.txt"), encoding );

Char C [] = new char [10];

Int length = reader. Read (C );

For (INT I = 0; I <length; I ++ ){

System. Out. println (C [I]);

}

What is the result? Is "you ". If encoding = "8859_1", what is the result? "?? "Two characters, indicating not recognized. In turn, the example is done by yourself.

5. We need to understand the Java compiler:

Javac? Encoding

We often do not use the encoding parameter. In fact, the encoding parameter is very important for cross-platform operations. If encoding is not specified, the default encoding of the system is used. The GB platform is gb2312, and the English platform is iso8859_1. The Java compiler actually calls sun. tools. javac. main class: Compile the file. This class has an encoding variable in the compile function. The-encoding parameter is actually directly transmitted to the encoding variable. The compiler reads the Java file based on this variable and then compiles it into a class file in UTF-8 format. Sample Code:

String STR = "you ";

Filewriter writer = new filewriter ("text.txt ");

Write. Write (STR );

Writer. Close ();

If you use gb2312 for compilation, you will find the E4 BD A0 field;

If 8859_1 is used for compilation, the binary format of 00c4 00e3 is:

Because each character must be greater than 7 characters, 11-bit encoding is used:

C1 -- 84 -- C3 -- A3

You will find C1 84 C3 A3

However, we often ignore this parameter, so there are often cross-platform problems:

Sample Code is compiled on the Chinese platform to generate zhclass

The sample code is compiled on the English platform and the output is enclass.

(1) zhclass runs OK on the Chinese platform, but not on the English platform.

(2) enclass runs OK on the English platform, but not on the Chinese Platform

The reason is:

(1) After compiling on the Chinese platform, STR is actually running in char [] 0x4f60, and runs on the Chinese platform. The default encoding of filewriter is gb2312, therefore, chartobyteconverter automatically uses the converter that calls gb2312 to convert STR to byte and input it to fileoutputstream. Therefore, 0xc4 and 0xe3 are put into the file. But on the English platform, if the default value of chartobyteconverter is 8859_1, filewriter will automatically call 8859_1 to convert STR, but he cannot explain it, so he will output "? "

(2) After compiling on the English platform, STR is actually running in char [] 0x00c4 0x00e3. It runs on the Chinese platform and cannot be recognized by Chinese characters ??; On the English platform, 0x00c4 --> 0xc4, 0x00e3-> 0xe3, so 0xc4 and 0xe3 are put into the file.

6. Other reasons:

Set the display encoding of the browser. If the response data is UTF-8 encoded, the display will be garbled, but the Garbled text is different from the above.

VII. encoding:

1. From database to Java program byte --> char

2. From Java program to database char --> byte

3. From file to Java program byte --> char

4. From Java program to file char --> byte

5. Display char --> byte from Java program to page

6. submit data from the form on the page to the Java program byte --> char

7. byte --> char

8. from Java program to stream char --> byte

You can use the configuration filter to solve Chinese Garbled text:

Requestfilter

Net. Golden. uirs. util. requestfilter

Charset

Gb2312

Requestfilter

*. Jsp

Public void dofilter (servletrequest req, servletresponse res,

Filterchain fchain) throws ioexception, servletexception {

Httpservletrequest request = (httpservletrequest) req;

Httpservletresponse response = (httpservletresponse) RES;

Httpsession session = request. getsession ();

String userid = (string) Session. getattribute ("userid ");

Req. setcharacterencoding (this. filterconfig. getinitparameter ("charset "));

// Set the character set?

In fact, the encoding of byte --> char is set.

Try {

If (userid = NULL userid. Equals ("")){

If (! Request. getrequesturl (). tostring (). Matches (

". */Uirs/logon (Controller) {0, 1} // x2ejsp $ ")){

Session. invalidate ();

Response. sendredirect (request. getcontextpath () +

"/Uirs/logon. jsp ");

}

}

Else {

// Check whether the information has the permission to be reported to the system.

If (! Net. Golden. uirs. util. uirschecker. Check (userid, "information reporting system ",

Net. Golden. uirs. util. uirschecker. action_do )){

If (! Request. getrequesturl (). tostring (). Matches (

". */Uirs/logon (Controller) {0, 1} // x2ejsp $ ")){

Response. sendredirect (request. getcontextpath () +

"/Uirs/logon/logoncontroller. jsp ");

}

}

}

}

Catch (exception ex ){

Response. sendredirect (request. getcontextpath () +

"/Uirs/logon. jsp ");

}

Fchain. dofilter (req, Res );

}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.