Java removes HTML tags (the way to remove HTML markup from Web pages in Java)--Regular expressions

Source: Internet
Author: User

Reference:

Http://www.cnblogs.com/newsouls/p/3995394.html

http://blog.csdn.net/he20101020/article/details/21228311

Content:

 PackageUtils;ImportJava.util.regex.Matcher;ImportJava.util.regex.Pattern;/*** Note: \ n Enter (\u000a) * \ t horizontal tab (\u0009) * \s space (\u0008) * \ r newline (\u000d) * Created by Administrator on 2017/7/14 . */ Public classHtmlutil { Public Static voidMain (string[] args) {String str= "<div class=\" Wb_cardwrap s_bg2\ ">\n" + "<div class=\" search_feed\ ">\n" + "                <div class=\ "Person_list_feed clearfix\" >\n "+" <div class=\ "pl_personlist\" >\n "+ "<div class=\" List_person clearfix\ ">\n" + "<div class=\" person_pic\ ">\n" + "<a target=\" _blank\ "href=\" http://weibo.com/114dotcom?refer_flag=1001030201_\ " title=\ "114 navigation \" Suda-data=\ "key=tblog_search_user&value=user_feed_1_icon\" >\n "+" < IMG class=\ "w_face_radius\" src=\ "http://tva4.sinaimg.cn/crop.176.129.505.505.180/           006acv5fgw1f0gxs4ikuyj30ne0juq47.jpg\ "uid=\" 5653836249\ "height=\" 80\ "width=\" 80\ "/></a>\n" + "            </div>\n "+" <div class=\ "person_detail\" >\n "+"          <p class=\ "person_name\" >\n "+"    <a class=\ "W_texta w_fb\" target=\ "_blank\" href=\ "http://weibo.com/114dotcom?refer_flag=1001030201_\" title=\ "                114 navigation \ "Uid=\" 5653836249\ "suda-data=\" key=tblog_search_user&value=user_feed_1_name\ ">114 navigation </a>\n" + "<a target=\" _blank\ "href=\" http://verified.weibo.com/verify\ "title=\" micro-BO organization certification \ "alt=\" micro-BO organization certification \ "C            Lass=\ "W_icon icon_approve_co\" ></a>\n "+" </p>\n "+" <p class=\ "person_addr\" >\n "+" <span class=\ "male m_icon\" title=\ "male \" ></spa n>\n "+" <span> Guangdong </span>\n "+" <a class=\ "W_li Nkb\ "target=\" _blank\ "href=\" http://weibo.com/114dotcom?refer_flag=1001030201_\ "class=\" wb_url\ "suda-data=\"            Key=tblog_search_user&value=user_feed_1_url\ ">http://weibo.com/114dotcom</a></p>\n" + " <p class=\ "person_card\" >\n "+" <em class=\ "red\" > 114 Network Co., Ltd. </em></p>\n "+"                <p class=\ "person_num\" >\n "+" <span> attention \ n "+" <a class=\ "w_linkb\" href=\ "http://weibo.com/5653836249/follow?refer_flag=1001030201_\" target=\ "_blank\"              Suda-data=\ "key=tblog_search_user&value=user_feed_1_num\" >68</a></span>\n "+" <span> fan \ \ "+" <a class=\ "w_linkb\" href=\ "Http://weibo.com/5653836249/fans?re Fer_flag=1001030201_\ "target=\" _blank\ "suda-data=\" key=tblog_search_user&value=user_feed_1_num\ ">118 </a></span>\n "+" <span> Weibo \ n "+" <a class =\ "w_linkb\" href=\ "http://weibo.com/5653836249/profile?refer_flag=1001030201_\" target=\ "_blank\" suda-data=\ " Key=tblog_search_user&value=user_feed_1_num\ ">7</a></span>\n "+" </p>\n "+" <div class=\ "Person_info \ ">\n" + "<p> Introduction: 114.com, not the same navigation, can you remember?" The pursuit of small fresh, simple to the extreme. Recommended to you is not only the URL, but also to give you the answer you need. In addition, 114.com also provides a variety of telephone, brand, celebrities, prices and other practical inquiries. </p>\n "+" </div>\n "+" <p class=\ "person_label\" > Tags: \ n "+" <a class=\ "w_linkb\" href=\ "&tag=%25e6%2596%25b0%25e9%2597%25bb%25e7%2583%25 Ad%25e7%2582%25b9&refer=suer_tag\ "suda-data=\" key=tblog_search_user&value=user_feed_1_label\ "> News hotspots                </a></p>\n "+" </div>\n "+" </div>\n "+ "</div>\n" + "</div>\n" + "</div>\n" + "&L t;/div>\n "+" <div class=\ "Wb_cardwrap s_bg2 relative\" ></div>\n "+" <!-- Not logged in prompt-->\n "+               "<div class=\" search_tips clearfix\ ">\n" + "<p class=\" tips_co\ ">\n" + "<span class=\" Tips_icon icon_warn\ "></span>\n" + "<span class=\" tips_txt\ "& gt;\n "+" <a href=\ "javascript:void (0); \" Action-type=\ "login\" > Sign in now </a> see more results. Don't have an account yet? Hurry \ "+" <a href=\ "http://weibo.com/signup/signup.php?lang=zh-cn&amp;entry=weisousuo\" Suda-da                Ta=\ "key=tblog_search_v4.1&amp;value=nologin_reg\" target=\ "_blank\" > Registered Weibo </a></span>\n "+ "</p>\n" + "</div>\n" + "<!--/not logged in prompt--";    SYSTEM.OUT.PRINTLN (Delhtmltag (str)); }     Public Staticstring Delhtmltag (String htmlstr) {string Regex_script= "<script[^>]*?>[\\s\\S]*?<\\/script>";//to define a regular expression for a scriptString regex_style= "<style[^>]*?>[\\s\\S]*?<\\/style>";//Regular expressions that define a styleString regex_html= "<[^>]+>";//Regular expressions that define HTML tagsString regex_space = "\\s*|\t|\r|\n";//Define a space carriage return line breakPattern P_script=Pattern.compile (regex_script,pattern.case_insensitive); Matcher M_script=P_script.matcher (HTMLSTR); Htmlstr=m_script.replaceall ("");//Filter Script TagsPattern P_style=Pattern.compile (regex_style,pattern.case_insensitive); Matcher M_style=P_style.matcher (HTMLSTR); Htmlstr=m_style.replaceall ("");//Filter Style LabelsPattern p_html=Pattern.compile (regex_html,pattern.case_insensitive); Matcher m_html=P_html.matcher (HTMLSTR); Htmlstr=m_html.replaceall ("");//Filter HTML TagsPattern P_space=pattern.compile (Regex_space, pattern.case_insensitive); Matcher M_space=P_space.matcher (HTMLSTR); Htmlstr= M_space.replaceall ("");//Filter Blank Enter label        returnHtmlstr.trim ();//returns a text string    }     Public Staticstring striphtml (string content) {//<p> paragraph replaced by line breakContent = Content.replaceall ("<p .*?>", "\ r \ n");//<br><br/> Replace with line breakContent = Content.replaceall ("<br\\s*/?>", "\ r \ n"));//Get rid of the other <> stuff .Content = Content.replaceall ("\\<.*?>", "" ");//Restore HTML//content = htmldecoder.decode (content);        returncontent; }     Public Staticstring gettextfromhtml (String htmlstr) {Htmlstr=Delhtmltag (HTMLSTR); Htmlstr= Htmlstr.replaceall ("&nbsp;", "" "); //htmlstr = htmlstr.substring (0, Htmlstr.indexof (". ") +1);        returnHtmlstr; }}

Java removes HTML tags (the way to remove HTML markup from Web pages in Java)--Regular expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.