Use a regular expression to delete an HTML tag.
ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern; Public classhtmlspirit{ Public Staticstring Delhtmltag (String htmlstr) {string Regex_script= "<script[^>]*?>[\\s\\S]*?<\\/script>";//to define a regular expression for a scriptString regex_style= "<style[^>]*?>[\\s\\S]*?<\\/style>";//Regular expressions that define a styleString regex_html= "<[^>]+>";//Regular expressions that define HTML tagsPattern P_script=Pattern.compile (regex_script,pattern.case_insensitive); Matcher M_script=P_script.matcher (HTMLSTR); Htmlstr=m_script.replaceall ("");//Filter Script TagsPattern P_style=Pattern.compile (regex_style,pattern.case_insensitive); Matcher M_style=P_style.matcher (HTMLSTR); Htmlstr=m_style.replaceall ("");//Filter Style LabelsPattern p_html=Pattern.compile (regex_html,pattern.case_insensitive); Matcher m_html=P_html.matcher (HTMLSTR); Htmlstr=m_html.replaceall ("");//Filter HTML Tags returnHtmlstr.trim ();//returns a text string } }
Ways to remove HTML markup from Web pages in Java
How to remove HTML tags from web pages in Java:
/**
* Remove the HTML code inside the string. <br>
* Require data to be standardized, such as greater than less than to match, otherwise it will be collective manslaughter.
*
* @param content
* Content
* @return Removed content
*/
Public Staticstring striphtml (string content) {//<p> paragraph replaced by line breakContent = Content.replaceall ("<p .*?>", "\ r \ n"); //<br><br/> Replace with line breakContent = Content.replaceall ("<br\\s*/?>", "\ r \ n")); //Get rid of the other <> stuff .Content = Content.replaceall ("\\<.*?>", "" "); //Restore HTML//content = htmldecoder.decode (content);returncontent;}
Java Removes HTML tags