Convert special characters in HTML into printable characters

Source: Internet
Author: User
Tags printable characters

In many cases, the content extracted from a web page contains many special escape characters.Entity name, Or yesEntity Encoding, Such

Display Description Entity name Entity ID

Blank & ENSP; & #8194;

Wide white space & Amp; emsp; & #8195;

Blank rows & Nbsp; & #160;
< Less & Lt; & #60;
> Greater & Gt; & #62;
& & Symbol & Amp; & #38;
" Double quotation marks & Quot; & Amp; #34;
Copyright & Copy; & #169;
Registered Trademark & Reg; & #174;
Trademark (USA) & #8482;
× Multiplication number & Times; & #215;
Bytes Division Number & Divide; & #247;

Here are some code written by our predecessors to convert entity encoding into printable characters: Java's html url character encoding into a Java string function | Chinese flex example. The principle is to obtain the corresponding characters based on the numbers in the Entity encoding.However, the above method does not support converting Entity names into printable characters. If "& #34;" can be converted into double quotation marks, it cannot be recognized "& quot ;". This time there is no fixed rule and you can only map it yourself. The practice in comparison is to convert the following common:

& Gt; >
& Quot; "
& Nbsp  
& Apos; '

The modified code is as follows:

/** * Convert HTML character enitities(Unicode) to part of a Java String */import java.util.regex.*;public class UnicodeCeToJavaString {static final String mbs = "&#(\\d+);"; // like "ロ"public static String EncodeCesToChars(String paramStr) {paramStr = paramStr.replace("&","&")        .replace("<","<")        .replace(">",">")        .replace(""","\"")        .replace(" "," ")        .replace("'","'");String mbChar;StringBuffer sb = new StringBuffer();Pattern pat = Pattern.compile(mbs);Matcher mat = pat.matcher(paramStr);while (mat.find()) {mbChar = getMbCharStr(mat.group(1)); // pass the digit partmat.appendReplacement(sb, mbChar);}mat.appendTail(sb);return new String(sb);}/* worker method */static String getMbCharStr(String digits) { // handle "12525" part which is// achar[] cha = new char[1]; // Unicode value stringnizedtry {int val = Integer.parseInt(digits);char ch = (char) val;cha[0] = ch;} catch (Exception e) {System.err.println("Error from getMbCharStr:");e.printStackTrace(System.err);}return new String(cha); // easy!, because Java uses Unicode}public static void main(String[] args) {System.out.println(UnicodeCeToJavaString.EncodeCesToChars("George&#39;s War in North America"));}}

For more information about the full encoding formats, see common HTML Escape characters, HTML Escape characters, JavaScript escape characters, HTML Escape Character lists, and special HTML character comparison tables (ISO Latin-1 character set ).


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.