In many cases, the content extracted from a web page contains many special escape characters.Entity name, Or yesEntity Encoding, Such
Display |
Description |
Entity name |
Entity ID |
|
Blank |
& ENSP; |
& #8194; |
|
Wide white space |
& Amp; emsp; |
& #8195; |
|
Blank rows |
& Nbsp; |
& #160; |
< |
Less |
& Lt; |
& #60; |
> |
Greater |
& Gt; |
& #62; |
& |
& Symbol |
& Amp; |
& #38; |
" |
Double quotation marks |
& Quot; |
& Amp; #34; |
|
Copyright |
& Copy; |
& #169; |
|
Registered Trademark |
& Reg; |
& #174; |
|
Trademark (USA) |
|
& #8482; |
× |
Multiplication number |
& Times; |
& #215; |
Bytes |
Division Number |
& Divide; |
& #247; |
Here are some code written by our predecessors to convert entity encoding into printable characters: Java's html url character encoding into a Java string function | Chinese flex example. The principle is to obtain the corresponding characters based on the numbers in the Entity encoding.However, the above method does not support converting Entity names into printable characters. If "& #34;" can be converted into double quotation marks, it cannot be recognized "& quot ;". This time there is no fixed rule and you can only map it yourself. The practice in comparison is to convert the following common:
& Gt; |
> |
& Quot; |
" |
& Nbsp |
|
& Apos; |
' |
The modified code is as follows:
/** * Convert HTML character enitities(Unicode) to part of a Java String */import java.util.regex.*;public class UnicodeCeToJavaString {static final String mbs = "&#(\\d+);"; // like "ロ"public static String EncodeCesToChars(String paramStr) {paramStr = paramStr.replace("&","&") .replace("<","<") .replace(">",">") .replace(""","\"") .replace(" "," ") .replace("'","'");String mbChar;StringBuffer sb = new StringBuffer();Pattern pat = Pattern.compile(mbs);Matcher mat = pat.matcher(paramStr);while (mat.find()) {mbChar = getMbCharStr(mat.group(1)); // pass the digit partmat.appendReplacement(sb, mbChar);}mat.appendTail(sb);return new String(sb);}/* worker method */static String getMbCharStr(String digits) { // handle "12525" part which is// achar[] cha = new char[1]; // Unicode value stringnizedtry {int val = Integer.parseInt(digits);char ch = (char) val;cha[0] = ch;} catch (Exception e) {System.err.println("Error from getMbCharStr:");e.printStackTrace(System.err);}return new String(cha); // easy!, because Java uses Unicode}public static void main(String[] args) {System.out.println(UnicodeCeToJavaString.EncodeCesToChars("George's War in North America"));}}
For more information about the full encoding formats, see common HTML Escape characters, HTML Escape characters, JavaScript escape characters, HTML Escape Character lists, and special HTML character comparison tables (ISO Latin-1 character set ).