Escape encoding and Unescape encoding converts a character to 16 Unicode encoding, preceded by a% character to be identified.
No more explanations here, see here: http://www.jb51.net/article/23657.htm.
Originally a method of JS, was later turned into a Java method. Specific reference here http://blog.sina.com.cn/s/blog_4bb52a160100d9tm.html, is the programmer copy and paste the most common code.
First look at escape source code:
/**
* Implementation of the JS foreground escape () function
*
* @param src
* @return
*/
public static string Escape (String src) {
int i;
Char J;
StringBuffer tmp = new StringBuffer ();
Tmp.ensurecapacity (Src.length () * 6);
for (i = 0; i < src.length (); i++) {
j = Src.charat (i);--Converts a character to an int value
if (Character.isdigit (j) | | | Character.islowercase (j) | | Character.isuppercase (j))
Tmp.append (j);--1. If it is a number or a letter, use it directly
else if (J < 256) {
Tmp.append ("%");--2. If in [16-255], then add% prefix
if (J < 16)
Tmp.append ("0");--3. If the character encoding is <16, precede with the%0 prefix (0 to encode 2 character widths)
Tmp.append (Integer.tostring (J, 16));
} else {
Tmp.append ("%u");
Tmp.append (Integer.tostring (J, 16));--4. All other encodings are prefixed with%u
}
}
return tmp.tostring ();
}
Look again at the UNESCAP method:
public static string unescape (String src) {
StringBuffer tmp = new StringBuffer ();
Tmp.ensurecapacity (Src.length ());
int lastpos = 0, pos = 0;
Char ch;
while (Lastpos < Src.length ()) {
pos = src.indexof ("%", lastpos); --Check% number
if (pos = = Lastpos) {
if (Src.charat (pos + 1) = = ' U ') {
ch = (char) integer.parseint(src.substring (pos + 2, pos + 6), 16); //5 --When%u is encountered, the following 4 width characters are read for decoding
Tmp.append (CH);
Lastpos = pos + 6;
} else {
ch = (char) integer.parseint (src.substring (pos + 1, pos + 3), +//6--other%, reads 2 width [0-255] of 16 progress code, decodes
Tmp.append (CH);
Lastpos = pos + 3;
}
} else {
if (pos = =-1) {
Tmp.append (src.substring (Lastpos));
Lastpos = Src.length ();
} else {
Tmp.append (Src.substring (Lastpos, POS));
Lastpos = pos;
}
}
}
return tmp.tostring ();
}
The code logic is simple, parsing 2 width [0-255] and 4 width [4096-65535] characters, respectively.
But there are 2 questions: 3 width [256-4095] The character designators does not exist? Does the width of more than 4 characters exist? If present, this code has a serious bug that can cause parsing to fail.
Let's start with the first question:
The East Asian language and most languages Unicode encoding translates to 4 widths after converting to 16, but does not imply that 3-width characters do not exist. For example, Baidu Encyclopedia of the Indian language of Yoga:???, 3 characters, converted 16 After the 3 width of the system. %u92f%u94b%u917, the above code will unescape fail for this type of character.
The workaround guarantees the generated >255 character encoding, which has 4 widths.
The code in red Note 4 is modified to:
if (j<4096) { tmp.append (016));--4. All other encodings are prefixed with%u
Or
Tmp.append (String.Format ("%04x", J))
Second question:
A hexadecimal 4 width represents 2 bytes. The current Unicode specification is ucs-2, which means that all characters are stored in double-byte. So the code can be done. If you later upgrade to Ucs-4, or even ucs-8, this code is definitely a problem. However, it should be a matter of n years. Ucs-2 is sufficient to meet most of the current scenarios.
An online Java common escape and unescape method bug