Complete URL Chinese Encoding
I thought this would work.
Public static string strtourl (string Str ){
String result = NULL;
Try {
Result = urlencoder. encode (STR, "UTF-8"); // gb2312
} Catch (unsupportedencodingexception e ){
// Todo auto-generated Catch Block
E. printstacktrace ();
}
Return result;
}
Public static string urltostr (string URL ){
String result = NULL;
Try {
Result = urldecoder. Decode (URL, "UTF-8 ");
// Replace the space with the plus sign
Result = result. replaceall ("", "+ ");
} Catch (unsupportedencodingexception e ){
// Todo auto-generated Catch Block
E. printstacktrace ();
}
Return result;
}
I didn't expect Chinese characters to be garbled. I caught the packet and found that it was correct. I found it was a problem with Tomcat. I just added the following code.
<Connector Port = "8236" protocol = "HTTP/1.1" redirectport = "8443" uriencoding = "UTF-8"/>
/*
When a form in a webpage is submitted using the POST method, the data content type is application/X-WWW-form-urlencoded. This type will:
1. character "a"-"Z", "a"-"Z", "0"-"9 ",". ","-"," * ", and" _ "are not encoded;
2. Convert the space to the plus sign (+ );
3. convert non-text content to "% XY". XY is a two-digit hexadecimal value;
4. Place the & symbol between each name = value pair.
*/
The urlencoder class contains a static method to convert a string to the application/X-WWW-form-urlencoded MIME format.
One of the challenges web designers face is how to handle the differences between different operating systems. These differences cause URL problems: for example, some operating systems allow file names to contain space characters, and some do not. Most operating systems do not think that the file name contains the symbol "#", but in a URL, the symbol "#" indicates that the file name has ended, followed by a fragment (partial) identifier. Other special characters, non-alphanumeric character sets, have special meanings in URLs or other operating systems and express similar problems. To solve these problems, the characters used in the URL must be elements in the fixed character set of an ASCII character set, as shown below:
1. capital letter A-Z
2. lowercase letters A-z
3. numbers 0-9
4. Point character -_.! ~ * '(And ,)
Such as character :/&? @ #; $ + = And % can also be used, but they have their own special purposes. If a file name contains these characters (/&? @ #; $ + = %), These characters and all other characters should be encoded.
The encoding process is very simple. Any character, as long as it is not an ASCII number, letter, or point character, will be converted into bytes. Each byte is written in this form: A "%" is followed by two hexadecimal values. Space is a special case because they are too common. In addition to being encoded as "% 20", it can also be encoded as a "+ ". The plus sign (+) is encoded as % 2B. When/# = & and? When used as part of the name, rather than as a separator between the URL, they should be encoded.
Warning is not very effective in heterogeneous environments with a large number of character sets. For example, in U. S. Windows, e is encoded as % e9. in U. S. Mac, e is encoded as % 8e. The existence of such uncertainty is an obvious deficiency of the existing Uri. Therefore, in the future, the URI specification should be improved through the International Resource Identifier (IRIS.
The class URL does not automatically perform encoding or decoding. You can generate a URL object that can contain illegal ASCII and non-ASCII characters and/or % xx. When the methods getpath () and toexternalform () are used as output methods, such characters and delimiters are not automatically encoded or decoded. You should be responsible for the string object used to generate a URL object and ensure that all characters are properly encoded.
Fortunately, Java provides a class urlencoder that encodes string into this form. Java adds a class urldecoder which can decode strings in this form. Initialization is not required for both classes:
Public class urldecoder extends object
Public class urlencoder extends object
1. urlencoder
In Java 1.3 and earlier versions, the class java.net. urlencoder includes a simple static method encode (), which encodes the string with the following rules:
Public static string encode (string S)
This method always uses the default encoding format of the platform where it is located. Therefore, it produces different results on different systems. In result java1.4, this method is replaced by another method. This method requires you to specify the encoding format:
Public static string encode (string S, string encoding) throws unsupportedencodingexception
The two encoding methods convert any non-alphanumeric character to % XX (except for spaces, underscores (_), and hyphens (?), Full stop (.), And star number (*)). Both are encoded as non-ASCII characters. The space is converted into a plus sign. These methods are a little too cumbersome; they also put "~", Convert "'" and "()" to % XX even if they do not need to do so at all. In this case, the conversion is not forbidden by the URL specification. Therefore, the Web browser will naturally process these excessively encoded URLs.
Both the encoding methods return a new encoded string. The java1.3 method encode () uses the default encoding format of the platform to get % xx. These encoding formats are typical: In U. s. ISO-8859-1 On Unix systems, in U. s. cp1252 on windows, in U. s. macroman on Macs and other local character sets. Because the encoding and decoding processes are related to the local operating platform, these methods are unpleasant and cannot be cross-platform.
This gives a clear answer to why this method was abandoned in java1.4 and switched to the method that requires encoding in its own format. However, if you want to use the default encoding format of your platformProgramIt will be related to the local platform like the program in java1.3. In another way of coding, you should always use UTF-8 instead of anything else. UTF-8 is compatible with new Web browsers and more other software than the other encoding formats you choose.
In Example 7-8, urlencoder. encode () is used to print various encoded strings. It must be compiled and run in java1.4 or an updated version.
Example 7-8. X-WWW-form-urlencoded strings
Import java.net. urlencoder;
Import java.net. urldecoder;
Import java. Io. unsupportedencodingexception;
Public class encodertest {
Public static void main (string [] ARGs ){
Try {
System. Out. println (urlencoder. encode ("this string has spaces", "UTF-8 "));
System. Out. println (urlencoder. encode ("This * string * has * asterisks", "UTF-8 "));
System. Out. println (urlencoder. encode ("This % string % Has % percent % signs", "UTF-8 "));
System. Out. println (urlencoder. encode ("This + String + has + pluses", "UTF-8 "));
System. Out. println (urlencoder. encode ("This/string/has/slashes", "UTF-8 "));
System. Out. println (urlencoder. encode ("this" string "has" quote "marks", "UTF-8 "));
System. Out. println (urlencoder. encode ("This: String: Has: colons", "UTF-8 "));
System. Out. println (urlencoder. encode ("This ~ String ~ Has ~ Tildes "," UTF-8 "));
System. Out. println (urlencoder. encode ("This (string) has (parentheses)", "UTF-8 "));
System. Out. println (urlencoder. encode ("This. String. Has. Periods", "UTF-8 "));
System. Out. println (urlencoder. encode ("This = string = has = equals = signs", "UTF-8 "));
System. Out. println (urlencoder. encode ("This & string & has & ersands", "UTF-8 "));
System. Out. println (urlencoder. encode ("This é string é has é non-ASCII characters", "UTF-8 "));
// System. Out. println (urlencoder. encode ("This People's Republic of China", "UTF-8 "));
} Catch (unsupportedencodingexception ex) {Throw new runtimeexception ("
Broken VM does not support UTF-8 ");
}
}
}
Below is its output. Note thatCodeIt should be saved in other encoding forms rather than in ASCII code form, and the encoding form you choose should be passed as a parameter to the compiler so that the compiler canSource codeNon-ASCII characters in.
% Javac-encoding utf8 encodertest %
Java encodertest
This + String + has + Spaces
This * string * has * asterisks
This % 25 string % 25has % 25 percent % 25 signs
This % 2 bstring % 2 bhas % 2 bpluses
This % 2 fstring % 2 fhas % 2 fslashes
This % 22 string % 22has % 22 quote % 22 marks
This % 3 astring % 3 Ahas % 3 acolons
This % 7 estring % 7 ehas % 7 etildes
This % 28 string % 29has % 28 parentheses % 29
This. String. Has. Periods
This % 3 dstring % 3 dhas % 3 dequals % 3 dsigns
This % 26 string % 26has % 26 ampersands
This % C3 % a9string % C3 % a9has % C3 % A9non-ASCII + characters
Note that this method encodes the symbols, "\", &, =, And :. It does not try to specify how these characters are used in a URL. Therefore, you have to encode your url in blocks instead of passing the whole URL to this method once. This is very important, because the most common usage of the urlencoder class is to query the string, In order to interact with the program using the get method on the server side. For example, if you want to encode the query sting, it is used to search for the Altavista Website:
Pg = Q & KL = XX & stype = stext & Q = + "Java + I/O" & search. x = 38 & search. Y = 3
Encode this Code:
String query = urlencoder. encode ("PG = Q & KL = XX & stype = stext & Q = +" Java + I/O "& search. X = 38 & search. y = 3 "); system. out. println (query );
Unfortunately, the output is:
PG % 3dq % 26kl % 3dxx % 26 stype % 3 dstext % 26q % 3d % 2B % 22 Java % 2bi % 2fo % 22% 26search. x % 3d38% 26search. Y % 3d3
This problem occurs when the urlencoder. encode () method is blindly encoded. It cannot distinguish between the special characters used in the URL or query string (such as "=" and "&" in the preceding string) and the characters that actually need to be encoded. Therefore, the URL needs to encode only one piece at a time like the following:
String query = urlencoder. encode ("PG ");
Query + = "= ";
Query + = urlencoder. encode ("Q ");
Query + = "&";
Query + = urlencoder. encode ("Kl ");
Query + = "= ";
Query + = urlencoder. encode ("XX ");
Query + = "&";
Query + = urlencoder. encode ("Stype ");
Query + = "= ";
Query + = urlencoder. encode ("stext ");
Query + = "&";
Query + = urlencoder. encode ("Q ");
Query + = "= ";
Query + = urlencoder. encode ("" Java I/O "");
Query + = "&";
Query + = urlencoder. encode ("search. X ");
Query + = "= ";
Query + = urlencoder. encode ("38 ");
Query + = "&";
Query + = urlencoder. encode ("search. Y ");
Query + = "= ";
Query + = urlencoder. encode ("3 ");
System. Out. println (query );
This is the output you really want:
Pg = Q & KL = XX & stype = stext & Q = % 2B % 22 Java + I % 2fo % 22 & search. x = 38 & search. Y = 3
Example 7-9 is a querystring class. In a Java object, it uses the urlencoder class to encode consecutive attribute name and attribute value pairs. This Java object is used to send data to the server-side program.
When creating a querystring object, you can pass the first property pair in the query string to the querystring-like constructor to obtain the initial string. If you want to add another attribute pair, you should call the add () method. It can also accept two strings as parameters and encode them. The getquery () method returns the entire string encoded by an attribute pair.
Example 7-9.-The querystring class
Package com.macfaq.net;
Import java.net. urlencoder;
Import java. Io. unsupportedencodingexception;
Public class querystring {
Private stringbuffer query = new stringbuffer ();
Public querystring (string name, string value ){
Encode (name, value );
}
Public synchronized void add (string name, string value ){
Query. append ('&');
Encode (name, value );
}
Private synchronized void encode (string name, string value ){
Try {
Query. append (urlencoder. encode (name, "UTF-8 "));
Query. append ('= ');
Query. append (urlencoder. encode (value, "UTF-8 "));
} Catch (unsupportedencodingexception ex ){
Throw new runtimeexception ("Broken VM does not support UTF-8 ");
}
}
Public String getquery (){
Return query. tostring ();
}
Public String tostring (){
Return getquery ();
}
}
With this class, we can encode the string in the previous example:
Querystring Qs = new querystring ("PG", "Q ");
Qs. Add ("Kl", "XX ");
Qs. Add ("Stype", "stext ");
Qs. Add ("Q", "+" Java I/O "");
Qs. Add ("search. X", "38 ");
Qs. Add ("search. Y", "3 ");
String url = "http://www.altavista.com/cgi-bin/query? "+ Qs;
System. Out. println (URL );
Ii. urldecoder
The urldecoder class corresponding to the urlencoder class has two static methods. They decode strings encoded in the form of X-WWW-form-URL-encoded. That is to say, they convert all the plus signs (+) into space characters, and convert all % XX into corresponding characters respectively:
Public static string decode (string s) throws exception
Public static string decode (string S, string encoding) // Java 1.4 throws
Unsupportedencodingexception
The first decoding method is used in java1.3 and java1.2. The second decoding method is used in java1.4 and later versions. If you have no idea which encoding method to use, choose UTF-8. It is more likely to get the correct result than any other encoding format.
If a string contains a "%" but not followed by two hexadecimal numbers or is decoded into an invalid sequence, this method will throw an illegalargumentexception exception. When this happens again, it may not be thrown. This is related to the runtime environment. When an invalid sequence is detected, the illegalargumentexception exception is not thrown, so what will happen in this case is uncertain. In Sun's JDK 1.4, no exception is thrown. It adds some inexplicable bytes to strings that cannot be properly encoded. This is indeed a headache, probably a security vulnerability.
Since this method does not touch non-escape characters, you can pass the entire URL as a parameter to this method without being segmented as before. For example:
String input = "http://www.altavista.com/cgi-bin/" + "query?
Pg = Q & KL = XX & stype = stext & Q = % 2B % 22 Java + I % 2fo % 22 & search. x = 38 & search. Y = 3 ";
Try {
String output = urldecoder. Decode (input, "UTF-8 ");
System. Out. println (output );
}