<p style= "margin-top:0px; margin-bottom:0px; padding-top:0px; padding-bottom:0px; Font-family:helvetica, Tahoma, Arial, Sans-serif; font-size:14px; line-height:25.1875px; " > Using regular expressions, see the following code: </p><p style= "margin-top:0px; margin-bottom:0px; padding-top:0px; padding-bottom:0px; Font-family:helvetica, Tahoma, Arial, Sans-serif; font-size:14px; line-height:25.1875px; " > </p>
Import Java.util.regex.Matcher; Import Java.util.regex.Pattern; public class htmlspirit{public static string Delhtmltag (String htmlstr) {string regex_script= "<script[ ^>]*?>[\\s\\s]*?<\\/script> "; Defines a regular expression for script String regex_style= "<style[^>]*?>[\\s\\S]*?<\\/style>"; A regular expression that defines a style String regex_html= "<[^>]+>"; Regular expression for defining HTML tags Pattern p_script=pattern.compile (regex_script,pattern.case_insensitive); Matcher M_script=p_script.matcher (HTMLSTR); Htmlstr=m_script.replaceall (""); Filter the script tag Pattern p_style=pattern.compile (regex_style,pattern.case_insensitive); Matcher M_style=p_style.matcher (HTMLSTR); Htmlstr=m_style.replaceall (""); Filter style label Pattern p_html=pattern.compile (regex_html,pattern.case_insensitive); Matcher M_html=p_html.matcher (HTMLSTR); Htmlstr=m_html.replaceall (""); FilterHTML tag return Htmlstr.trim (); Returns a text string}} The method of removing HTML markup from Web pages in Java:/** * Remove the HTML code from the string. <br> * Require data to be standardized, such as greater than less than to match, otherwise it will be collective manslaughter. * * @param content * Content * @return Removed * * * public static string striphtml (string content) {//<p> paragraph replaced with newline Content = Content.replaceall ("<p .*?>", "\ r \ n"); <br><br/> Replace with newline content = Content.replaceall ("<br\\s*/?>", "\ r \ n"); Get rid of other things between <> content = Content.replaceall ("\\<.*?>", "" "); Restore HTML/content = Htmldecoder.decode (content); return content; }Reference url:http://xiejincheng.blog.51cto.com/2307724/722731
Java HTML tag Removal