Java's experience in Chinese Garbled text processing

Last Update:2018-12-05 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why is Garbled text a unavoidable topic for Chinese programmers? First of all, we should start with the encoding mechanism. The Chinese and English encoding formats are not the same, and decoding is also different! If programmers in China do not encounter garbled characters, they only use Chinese programming. I am not very clear about what happened in Chinese programming. It should have been the year before. A friend of mine introduced me to Chinese programming. How is it good? At that time, I did not pay attention to this because of my busy study. When I was idle, the friend did not get this, and asked him not to be clear about it. Finally, I could not learn it myself.
Today, I am writing this article not to explain the gap between Chinese and English, decoding and so on. I have encountered various garbled solutions in my work over the past few years, I also hope that you can talk about all the methods to solve Garbled text by Fainting yourself. Let's get a "sunflower Collection" to solve Garbled text ".

For Java, the default encoding method is Unicode, so it is easy to use Chinese characters. The common solution is
String S2 = new string (s1.getbytes ("ISO-8859-1"), "GBK ");

1. utf8 solves JSP Chinese garbled characters
Generally, at the beginning of each page, add:

<% @ Page Language = "Java" contenttype = "text/html; charset = UTF-8"
Pageencoding = "UTF-8" %>

<%
Request. setcharacterencoding ("UTF-8 ");
%>

Charset = UTF-8 is used to specify the encoding method that JSP outputs to the client as UTF-8"

Pageencoding = "UTF-8" to enable the JSP Engine to correctly decode JSP pages containing Chinese characters, this is very effective in Linux

Request. setcharacterencoding ("UTF-8"); is a Chinese encoding of the request

Sometimes, this still cannot solve the problem. You need to handle it as follows:

String MSG = request. getparameter ("message ");
String STR = new string (msg. getbytes ("ISO-8859-1"), "UTF-8 ");
Out. println (ST );

2. Tomcat 5.5 Chinese garbled characters

) Just Put % tomcat installation directory %/webapps/Servlets-examples/WEB-INF/classes/filters/setcharacterencodingfilter. copy the class file to your webapp directory/filters. If there is no filters directory, create one.
2) Add the following lines to your web. xml: <filter>
<Filter-Name> set character encoding </filter-Name>
<Filter-class> filters. setcharacterencodingfilter </filter-class>
<Init-param>
<Param-Name> encoding </param-Name>
<Param-value> GBK </param-value>
</Init-param>
</Filter>
<Filter-mapping>
<Filter-Name> set character encoding </filter-Name>
<URL-pattern>/* </url-pattern>
</Filter-mapping>

3) complete.
2. Get Solution
1) Open the Tomcat server. xml file, locate the block, and add the following line:
Uriencoding = "GBK"
The complete information should be as follows:
<Connector
Port = "80" maxthreads = "150" minsparethreads = "25" maxsparethreads = "75"
Enablelookups = "false" redirectport = "8443" acceptcount = "100"
DEBUG = "0" connectiontimeout = "20000"
Disableuploadtimeout = "true"
Uriencoding = "GBK"
/>

2) Restart tomcat. Everything is OK.

3. XMLHttpRequest Chinese

GBK encoding for page JSP

Code
<% @ Page contenttype = "text/html; charset = GBK" %>

Javascript Section

Code
Function addfracasreport (){
VaR url = "controler? Actionid = 0_06_03_01 & actionflag = 0010 ";
VaR urlmsg = "& reportid =" + fracasreport1.textreportid. value; // fault report table No.

VaR XMLHTTP = Common. createxmlhttprequest ();
XMLHTTP. onreadystatechange = Common. getreadystatehandler (XMLHTTP, Eval ("turnanalypage "));
XMLHTTP. Open ("Post", URL, true );
XMLHTTP. setRequestHeader ("Content-Type", "application/X-WWW-form-urlencoded );
XMLHTTP. Send (urlmsg );

}

The reportid obtained in the background Java is garbled and I don't know how to convert it. I mainly don't know XMLHTTP. Send (urlmsg); what encoding will it be in the future? I tried several methods using Java later, but none of them were successful, including:

Code
Public static string utf_8togbk (string Str ){
Try {
Return new string (Str. getbytes ("UTF-8"), "GBK ");
} Catch (exception ex ){
Return NULL;
}
}

Public static string utf8togbk (string Str ){
Try {
Return new string (Str. getbytes ("UTF-16BE"), "GBK ");
} Catch (exception ex ){
Return NULL;
}
}

Public static string GBK (string Str ){
Try {
Return new string (Str. getbytes ("GBK"), "GBK ");
} Catch (exception ex ){
Return NULL;
}
}
Public static string getstr (string Str ){
Try {
String temp_p = STR;
String temp = new string (temp_p.getbytes ("iso8859_1"), "GBK ");
Temp = sqlstrchop (temp );
Return temp;
} Catch (exception e ){
Return NULL;
}
}

4. jdbc odbc bridge bug and Solution

JDBC-ODBC bridge was hard to find bugs when writing a database management program. When inserting data into a data table, if it is an English character, the stored content is completely correct. If it is saved to a Chinese character, some databases can only store the first seven or eight Chinese characters, and other content is truncated, the storage content is incomplete (some databases do not have this problem, such as Sybase SQL Anywhere 5.0. JDBC-ODBC bridge also has a bug where the table cannot be created ).

This is a bad news for Java programmers who need to store Chinese information. You can either use other programming languages or choose other expensive database products. The goal of "one-time writing and running everywhere" is also compromised. Can we fix this problem by processing Chinese information and then storing it? The answer is yes.

Solutions
Java adopts the Unicode code encoding method, and supports 16-bit storage for both Chinese and English characters. Since the storage of English information is correct, according to certain rules, after converting Chinese information into English information for storage, there will naturally be no truncation. When reading information, perform reverse operations to restore the English information to Chinese information. According to the gb2312 encoding rules, Chinese characters are generally two ASCII codes with a high position of 1. During conversion, the two high positions of one Chinese character are removed, during restoration, add two high positions 1. To process Chinese strings containing English characters, a byte 0 mark must be added to English characters. The following two public static methods can be added to any class.

Convert Chinese and English strings to English strings only
Public static string totureasciistr (string Str ){

Stringbuffer sb = new stringbuffer ();

Byte [] bt = Str. getbytes ();

For (INT I = 0; I <Bt. length; I ++ ){

If (BT [I] <0 ){

// It is the Chinese character that goes to the top 1

SB. append (char) (BT [I] & 0 × 7f ));

} Else {// a record with 0 English characters

SB. append (char) 0 );

SB. append (char) BT [I]);

}

Return sb. tostring ();

}

Returns the converted string.
Public static string untotrueasciistr (string Str ){

Byte [] bt = Str. getbytes ();

Int I, L = 0, length = Bt. length, j = 0;

For (I = 0; I <length; I ++ ){

If (BT [I] = 0 ){

L ++;

}

Byte [] bt2 = new byte [length-L];

For (I = 0; I <length; I ++ ){

If (BT [I] = 0 ){

I ++;

Bt2 [J] = BT [I];

} Else {

Bt2 [J] = (byte) (BT [I] | 0 × 80 );

}

J ++;

}

String TT = new string (bt2 );

Return tt;

}

The above example works well in actual programming, but the Stored Chinese information must be processed in the same way before it can be used by other systems. In addition, if the Chinese string contains English characters, additional storage space is actually added.

5. Chinese Problems and Solutions for servlet programming in Solaris
When I used Java to develop an application system on the Internet, I found that the servlet debugging in Windows was completely normal and uploaded to the Solaris server, but the operation failed. The returned Webpage could not display Chinese characters, it should be that all Chinese information is garbled; keywords are used for Chinese information and the database cannot be correctly searched. Later, I added the check code and other methods to find out the cause of the fault as follows:

Garbled characters are displayed mainly because the setcontenttype method provided by the class httpservletresponse cannot change the encoding method of the data returned to the customer. The correct encoding method should be gb2312 or GBK, but in fact the default ISO8859-1. Chinese information cannot be retrieved because the servlet cannot correctly decode the Chinese information submitted by the customer after it reaches the server through browser encoding.

Examples show how to solve garbled characters
Servlet is generally used as follows:

Public class zldtestservlet extends httpservlet {

Public void doget (httpservletrequest request, httpservletresponse response) throws servletexception, ioexception {

// Before using writer to return data to the browser, set the Content-Type header and set the corresponding character set gb2312

Response. setcontenttype ("text/html; charset = gb2312 ");

Printwriter out = response. getwriter ();//*

// Formally return data

Out. println ("<HTML>

Out. println ("this is a test page! ");

Out. println ("</body>

Out. Close ();

}

...

}

To solve the problem of garbled characters displayed on the page, replace the * code with the following:

Printwriter out = new printwriter (New outputstreamwriter (response. getoutputstream (), "gb2312 "));

Troubleshooting of Solaris Chinese Information Retrieval
When a browser uses a form to submit information to the server, it generally uses the X-WWW-form-urlencoded MIME format to encode the data. If the get method is used, the parameter name and value are encoded and appended to the URL, which is called a query string in Java ).

In the servlet program, if the parameter value is obtained using the servletrequest method getparameter, Chinese characters cannot be correctly decoded in the Solaris environment. Therefore, the database cannot be retrieved correctly.

The urlencode and urldecode classes are provided in Java 1.2 package --java.net. The urlencode class provides a method to convert a specified string in X-WWW-form-urlencoded format. Urlencode provides the inverse method.

6. garbled common mail
Common mail is a small and convenient mail package. It encapsulates Java mail and is very convenient to use. However, I found that, if you use plain text content for sending, the result is garbled and the code is as follows:
Public class testcommonmail {
Public static void main (string [] ARGs) throws emailexception, messagingexception {
Simpleemail email = new simpleemail ();
Email. setcharset ("gb2312 ");
Email. sethostname ("smtp.163.com ");
Email. setsubject ("test ");
Email. addto ("test@163.com ");
Email. setfrom ("test@163.com ");
Email. setmsg ("My tests ");
Email. setauthentication ("test", "test ");
Email. Send ();
}
}

Analyzed the source code of commons mail and found the cause. The source code is as follows:
Public class simpleemail extends email
{
Public email setmsg (string MSG) throws emailexception, messagingexception
{
If (emailutils. isempty (MSG ))
{
Throw new emailexception ("Invalid Message supplied ");
}

Setcontent (MSG, text_plain );
Return this;
}
}

Email code snippet
Public void setcontent (Object aobject, string acontenttype)
{
This. content = aobject;
If (emailutils. isempty (acontenttype ))
{
This. contenttype = NULL;
}
Else
{
// Set the content type
This. contenttype = acontenttype;

// Set the charset if the input was properly formed
String strmarker = "; charset = ";
Int charsetpos = acontenttype. tolowercase (). indexof (strmarker );
If (charsetpos! =-1)
{
// Find the next space (after the marker)
Charsetpos + = strmarker. Length ();
Int intcharsetend =
Acontenttype. tolowercase (). indexof ("", charsetpos );

If (intcharsetend! =-1)
{
This. charset =
Acontenttype. substring (charsetpos, intcharsetend );
}
Else
{
This. charset = acontenttype. substring (charsetpos );
}
}
}
}

The send method of email. Send (); will be called.
Public void buildmimemessage () throws emailexception
{
Try
{
This. getmailsession ();
This. Message = new mimemessage (this. session );

If (emailutils. isnotempty (this. Subject ))
{
If (emailutils. isnotempty (this. charset ))
{
This. Message. setsubject (this. subject, this. charset );
}
Else
{
This. Message. setsubject (this. Subject );
}
}

// ================================================ ============================
// Start of replacement code
If (this. content! = NULL)
{
This. Message. setcontent (this. content, this. contenttype );
}
// End of replacement code
// ================================================ ============================
Else if (this. emailbody! = NULL)
{
This. Message. setcontent (this. emailbody );
}
Else
{
This. Message. setcontent ("", email. text_plain );
}

If (this. fromaddress! = NULL)
{
This. Message. setfrom (this. fromaddress );
}
Else
{
Throw new emailexception ("sender address required ");
}

If (this. tolist. Size () + this. cclist. Size () + this. bcclist. Size () = 0)
{
Throw new emailexception (
"At least one receiver address required ");
}

If (this. tolist. Size ()> 0)
{
This. Message. setrecipients (
Message. recipienttype.,
This. tointernetaddressarray (this. tolist ));
}

If (this. cclist. Size ()> 0)
{
This. Message. setrecipients (
Message. recipienttype. CC,
This. tointernetaddressarray (this. cclist ));
}

If (this. bcclist. Size ()> 0)
{
This. Message. setrecipients (
Message. recipienttype. BCC,
This. tointernetaddressarray (this. bcclist ));
}

If (this. replylist. Size ()> 0)
{
This. Message. setreplyto (
This. tointernetaddressarray (this. replylist ));
}

If (this. headers. Size ()> 0)
{
Iterator iterheaderkeys = This. headers. keyset (). iterator ();
While (iterheaderkeys. hasnext ())
{
String name = (string) iterheaderkeys. Next ();
String value = (string) headers. Get (name );
This. Message. addheader (name, value );
}
}

If (this. Message. getsentdate () = NULL)
{
This. Message. setsentdate (getsentdate ());
}

If (this. popbeforesmtp)
{
Store store = session. getstore ("POP3 ");
Store. Connect (this. pophost, this. popusername, this. poppassword );
}
}
Catch (messagingexception me)
{
Throw new emailexception (me );
}
}
The code can tell that the Java mail
Message. setcontent (this. content, this. contenttype );
Content is content
Contenttype is a type, such as text/plain,
(We can try sending an email directly using Java mail, and set the text content not to use the settext method, but also to use the setcontent ("test", "text/plain") method, you can see that the content is garbled)
The key is text/plain. We changed it to text/plain; charset = gb2312, and OK is fixed. In commons mail, we can see that the setmsg method in the simpleemail class calls setcontent (MSG, text_plain). We only need to modify the constant text_plain in the email class to add charset = your character set, re-package the jar.

7. Toad Character Set setting and Oracle Installation
Oracle Database Server installation is generally a Chinese character set, sometimes installed on different platforms, set to ISO encoding, toad is the best tool for Oracle development, not what I said, however, when toad is installed in a Chinese environment and Oracle with English characters is opened, Chinese characters are garbled. Required

Environment Variable> system variable
Add
Nls_lang = simplified chinese_china.zhs16gbk
Or
Nls_lang = american_america.we8iso8859p1

American_america.we8mswin1252

Or

Open the registry and click hkey_local_mathine.
Click software, and then oracle.
Click home (Oracle directory)
There is nls_lang in the right half of the Registry,
Double-click it to overwrite the original one.
It is best to write down the old one so that it can be changed back.

Connect sys/chang_on_install
Update props $
Set Value $ = 'zhs16cgb231280 ′
Where name = 'nls _ characterset ';
Commit;
So OK.

8. How to Solve GWT (Google Web Toolkit) Chinese problems
GWT Chinese garbled Solution

1. input the Chinese "test string" you want to display to a file, such as 1.txt.
2. Enter the command line, enter the directory where 1.txt is located, and enter the following command: native2ascii.exe 1.txt 2.txt and press Enter. In this case, another file 2.txt is generated.
3.2.txt:/u6d4b/u8bd5/u5b57/u7b26/u4e32
4. Use the above encoding and use it in GWT.

9. How is the webpage obtained by XMLHTTP garbled?
(1) Use webrequest on the server side instead of XMLHTTP
(2) Change

Streamreader sr = new streamreader (Stream );

For Simplified Chinese, change:

Streamreader sr = new streamreader (stream, encoding. Default );
For UTF-8:

Streamreader sr = new streamreader (stream, encoding. utf8 );
Of course, there are many other Members in the encoding enumeration, which can be used for different encoding content-types.

(3) later I found that whether the Content-Type is gb2312 or UTF-8

Streamreader sr = new streamreader (stream, encoding. Default );

Can return normal Chinese characters, so unified change to encoding. Default

---------------------------

Finally, the Code for obtaining the source code of a webpage from a URL on the server is as follows:

/// <Summary>
/// Post a specified URL to obtain the source code of the webpage (implemented using webrequest)
/// </Summary>
/// <Param name = "url"> </param>
/// <Returns>
/// If the request fails, null is returned.
/// If the request is successful, the source code of the webpage is returned
/// </Returns>
Public static string getcontentfromurl2 (string URL)
{
// Variable definition
String respstr;

Webrequest mywebrequest = webrequest. Create (URL );
// Mywebrequest. preauthenticate = true;
// Networkcredential = new networkcredential (username, password, domain );
// Mywebrequest. Credentials = networkcredential;

// Assign the response object of 'webrequest' to a 'webresponse' variable.
Webresponse mywebresponse = mywebrequest. getresponse ();
System. Io. Stream stream = mywebresponse. getresponsestream ();
Streamreader sr = new streamreader (stream, encoding. Default );
// Read data streams as strings
Respstr = Sr. readtoend ();
Sr. Close ();

Return respstr;

}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More