One: Cause
(1) Recently used for tasks need to always crawl the content of the Web page HTML, and similar to the crawler-like htmlparser contact more, crawl is nothing but to filter the information they want, so filter is the core, Of course, the matches (Regex) function and the contains (str) function in the string class are also very useful
(2) Often dealing with reptiles will analyze a variety of Web site design and layout: With the design of the very regular, such as QQ space, micro-blog information such as crawling very simple (when you want to page, you have to simulate the landing), some static sites are relatively easy to crawl; the rest of the trouble, the information is written in JS , you will cry, such as the contents of the mailbox crawl;
(3) If you crawl sensitive information, you need a little bit of regular expression syntax.
(4) Similar to the Htmlparser Toolkit, there is also HtmlClient, which provides the Htmlparser toolkit and the official online API
II: Introduction to filter Filters
as the name implies,Filteris to filter the results and get the content you need. Htmlparserin theorg.htmlparser.filterswithin the package defined altogether -a differentFilter, can also be divided into several categories.
(1) Judgment class Filter:
Tagnamefilter-----HTML tag specifies the specified filter
Hasattributefilter Specifying filters------Properties and property values
Haschildfilter whether------contains child
Hasparentfilter
Hassiblingfilter
Isequalfilter
(2) logic operation Filter:
Andfilter------Filters that meet two or more filter conditions at the same time
Notfilter------Non-
Orfilter-------or
Xorfilter
(3) Other Filter:
Nodeclassfilter
Stringfilter filter for filtering sensitive information-------
Linkstringfilter---------Filter for sensitive link information
Linkregexfilter-------link Regular expression filtering
Regexfilter--------HTML interface to display string regular expression filtering
Cssselectornodefilter
(4) All Filter classes implement the org.htmlparser.NodeFilter interface. This interface has only one primary function:
Boolean Accept (node node); each sub-class implements this function, which is used to determine the inputNodedoes it meet thisFilterthe filter condition, if compliant, returnstrue, or returnfalse.
Three: the problems encountered
(1) Question: Java.io.IOException:Server returned HTTP response code:403 for url:http://
Explanation: When you use a Java program to retrieve content on other sites, if its server is set to prohibit crawling, or its access requires permission, if you go to retrieve the Web page then there will be an exception that the exception appears, if the server needs access, such as you want to log in to access the Web page, then you can not crawl If it is a server-side prohibit crawl, then this you can set User-agent to deceive the server
Add User-agent Agent Connection.setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 5.0; Windows NT; Digext) ");
Digext is a special token issued when IE5 "Allow offline reading" mode; comatible: Compatibility mode; mozilla/4.0: Firefox version 4.0
URL url = new URL (htmlurl);//URL is java.net.*
HttpURLConnection httpurlconnection = (httpurlconnection) url.openconnection ();//Is java.net.*
HttpURLConnection. setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 5.0; Windows NT; Digext) ");
Parser.setconnection (httpurlconnection);
(2) Error message picture
What is the user agent? The user agent Chinese name is called "UA", which is a special string header that allows the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, browser plugin, etc. used by the client. Some websites often judge the UA to send different pages to different operating systems, different browsers, which may cause some pages to not display properly in a browser, but can bypass detection by disguising the UA.
(3) Code Show
/** * Filter the label information in the page * @param htmlurl to parse htmlurl page * @param encoding character encoding used * @param tagclass * To get or get the page label, such as to obtain the page hyperlink value is Linktag.class, to get the page picture link, the value is Imagetag.class * to pass in the label class is org.htmlparser.tags */pub Lic static void Nodefiltertagclass (String htmlurl,string encoding,class tagclass) {try {Parser Parser = New Parser (); Add URL proxy, spoof web page URL url = new URL (htmlurl); HttpURLConnection httpurlconnection = (httpurlconnection) url.openconnection (); Httpurlconnection.setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 5.0; Window NT; Digext) "); Parser.setconnection (httpurlconnection); Parser.seturl (Htmlurl); if (null==encoding) {parser.setencoding (parser.getencoding ()); }else{parser.setencoding (encoding); }//Filter the link label in the page nodefilter filter = new Nodeclassfilter (tagclass); NodeList list = Parser.extractallnodesthatmatch (filter); for (int i=0; i<list.size (); i++) {Node node = (node) list.elementat (i); System.out.println ("link is:" + node.tohtml ()); }} catch (Exception e) {e.printstacktrace (); }} public static void Main (string[] args) {//Crawl URL link address String htmlurl = "http://blog.csdn.net/u0107003 35 "; Get page <a href= ' xxx ' [attribute]> format link nodefiltertagclass (htmlurl, "UTF-8", Linktag.class); or fetch the page in the format of the link nodefiltertagclass (htmlurl, "UTF-8", Imagetag.class); Or take page <title>xxxx</title> title Nodefiltertagclass (htmlurl, "UTF-8", Titletag.class); Get page <div [property = ' property value ']> xxx</div> nodefiltertagclass (Htmlurl, "UTF-8", Div.class);}
The way of Big data processing (Htmlparser filter < two >)