The way of Big data processing (Htmlparser filter

The way of Big data processing (Htmlparser filter < two >)

Last Update:2014-12-25 Source: Internet

Author: User

Tags response code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One: Cause

(1) Recently used for tasks need to always crawl the content of the Web page HTML, and similar to the crawler-like htmlparser contact more, crawl is nothing but to filter the information they want, so filter is the core, Of course, the matches (Regex) function and the contains (str) function in the string class are also very useful

(2) Often dealing with reptiles will analyze a variety of Web site design and layout: With the design of the very regular, such as QQ space, micro-blog information such as crawling very simple (when you want to page, you have to simulate the landing), some static sites are relatively easy to crawl; the rest of the trouble, the information is written in JS , you will cry, such as the contents of the mailbox crawl;

(3) If you crawl sensitive information, you need a little bit of regular expression syntax.

(4) Similar to the Htmlparser Toolkit, there is also HtmlClient, which provides the Htmlparser toolkit and the official online API

II: Introduction to filter Filters

as the name implies,Filteris to filter the results and get the content you need. Htmlparserin theorg.htmlparser.filterswithin the package defined altogether -a differentFilter, can also be divided into several categories.
(1) Judgment class Filter:
Tagnamefilter-----HTML tag specifies the specified filter
Hasattributefilter Specifying filters------Properties and property values
Haschildfilter whether------contains child
Hasparentfilter
Hassiblingfilter
Isequalfilter
(2) logic operation Filter:
Andfilter------Filters that meet two or more filter conditions at the same time
Notfilter------Non-
Orfilter-------or
Xorfilter
(3) Other Filter:
Nodeclassfilter
Stringfilter filter for filtering sensitive information-------
Linkstringfilter---------Filter for sensitive link information
Linkregexfilter-------link Regular expression filtering
Regexfilter--------HTML interface to display string regular expression filtering
Cssselectornodefilter
(4) All Filter classes implement the org.htmlparser.NodeFilter interface. This interface has only one primary function:
Boolean Accept (node node); each sub-class implements this function, which is used to determine the inputNodedoes it meet thisFilterthe filter condition, if compliant, returnstrue, or returnfalse.

Three: the problems encountered

(1) Question: Java.io.IOException:Server returned HTTP response code:403 for url:http://

Explanation: When you use a Java program to retrieve content on other sites, if its server is set to prohibit crawling, or its access requires permission, if you go to retrieve the Web page then there will be an exception that the exception appears, if the server needs access, such as you want to log in to access the Web page, then you can not crawl If it is a server-side prohibit crawl, then this you can set User-agent to deceive the server

Add User-agent Agent Connection.setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 5.0; Windows NT; Digext) ");
Digext is a special token issued when IE5 "Allow offline reading" mode; comatible: Compatibility mode; mozilla/4.0: Firefox version 4.0
URL url = new URL (htmlurl);//URL is java.net.*
HttpURLConnection httpurlconnection = (httpurlconnection) url.openconnection ();//Is java.net.*
HttpURLConnection. setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 5.0; Windows NT; Digext) ");
Parser.setconnection (httpurlconnection);

(2) Error message picture

What is the user agent? The user agent Chinese name is called "UA", which is a special string header that allows the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, browser plugin, etc. used by the client. Some websites often judge the UA to send different pages to different operating systems, different browsers, which may cause some pages to not display properly in a browser, but can bypass detection by disguising the UA.

(3) Code Show

  /** * Filter the label information in the page * @param htmlurl to parse htmlurl page * @param encoding character encoding used * @param tagclass * To get or get the page label, such as to obtain the page hyperlink value is Linktag.class, to get the page picture link, the value is Imagetag.class * to pass in the label class is org.htmlparser.tags */pub  Lic static void Nodefiltertagclass (String htmlurl,string encoding,class tagclass) {try {Parser Parser =            New Parser ();            Add URL proxy, spoof web page URL url = new URL (htmlurl);            HttpURLConnection httpurlconnection = (httpurlconnection) url.openconnection (); Httpurlconnection.setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 5.0; Window NT;            Digext) ");            Parser.setconnection (httpurlconnection);            Parser.seturl (Htmlurl);            if (null==encoding) {parser.setencoding (parser.getencoding ());            }else{parser.setencoding (encoding);    }//Filter the link label in the page nodefilter filter = new Nodeclassfilter (tagclass);        NodeList list = Parser.extractallnodesthatmatch (filter);                for (int i=0; i<list.size (); i++) {Node node = (node) list.elementat (i);            System.out.println ("link is:" + node.tohtml ());        }} catch (Exception e) {e.printstacktrace (); }} public static void Main (string[] args) {//Crawl URL link address String htmlurl = "http://blog.csdn.net/u0107003        35 ";        Get page <a href= ' xxx ' [attribute]> format link nodefiltertagclass (htmlurl, "UTF-8", Linktag.class);        or fetch the page in the  format of the link nodefiltertagclass (htmlurl, "UTF-8", Imagetag.class);        Or take page <title>xxxx</title> title Nodefiltertagclass (htmlurl, "UTF-8", Titletag.class); Get page <div [property = ' property value ']> xxx</div> nodefiltertagclass (Htmlurl, "UTF-8", Div.class);}

The way of Big data processing (Htmlparser filter < two >)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More