Use Webpasser to crawl a joke website entire site content

Source: Internet
Author: User
Tags http cookie

Use the Webpasser framework to crawl the entire site content of a joke website. The Webpasser is a configurable crawler framework with a built-in page parsing engine that can quickly configure a crawler task. The configuration method separates page parsing from data storage and can be repaired quickly if the target site is revised.

The configuration instructions are as follows (see Http://git.oschina.net/passer/webpasser for the complete configuration of this example):

1. Write the total Crawl parameters: page encoding is GBK, the request time-out is 5 seconds, the request failed to retry 5 times, crawl failed to wait 10 seconds, set 10 threads crawl, after each crawl do not wait. This does not set the request header information, cookie, Proxy.

<Fetchconfigcharset="GBK"Timeoutsecond="5"errorretry="5"Errordelaytime="10"runthreadnum="10"Fetchpreparedelaytime="0" ><useragent>mozilla/5.0 (compatible; webpasser;)</Useragent><Headers><!--</headers><!--HTTP Cookie-- ><cookies><!--<cookie Name= "Cookie1" value= "host=" "path=" "/> <cookie name=" Cookie2 "value=" 1 "/>--> </cookies><!--proxy settings: From Ip.txt in bulk read IP, each fetch randomly use an IP-- ><!--<proxies path= "Ip.txt" ></proxies>- <!--a single agent get, Pollurl system link to randomly return a proxy IP, the format is Ip:port (when using the proxy tag proxies invalid)--><!--<proxy pollurl= "Http://localhost:8083/proxyManage/pollProxyIp.action?task=xunbo" ></proxy>--></FETCHCONFIG>   

2.scope indicates the range of linked domain names that restrict crawling (note that limithost must be a domain name, cannot be added to HTTP or/), and seeds indicates that it starts crawling from these seed entries.

<scope><limitHost value="www.jokeji.cn" /></scope> <!-- 种子 --><seeds><seed url="http://www.jokeji.cn/" /> </seeds>

3. If the link format of the Web page conforms to the scope rules, then the parsing policy for this page is entered. Diglink is used to dig up new links from a Web page. Use the Jsoup syntax to specify a new link to be dug from all a tags. Because the new link is a relative link, the relative link is converted to an absolute link using the Relativetofullurl processing chain.

<Page><Scope><RuleType="Regex"Value="Http://www.jokeji.cn (. *)"/></scope><!--link dug-up <diglink > < Rules><rule type=  "Jsoup" value= "a[^href" attr= "href"/> <rule Span class= "hljs-attr" >type= "Relativetofullurl"/> </rules> </diglink> </PAGE>          

4. Extract the specific business data that you want in the Details page, and then return the specified map data after parsing.

(1) Get Header data: The contents of the title tag are obtained through the Jsoup syntax (jquery selector), and the content after "_" is removed from the "_" by the "cut" note (intercept processing chain), because the content after "_" in the title content is not required.

(2) Get the data for the article content: Get the contents of the tag with ID text110 through the jsoup syntax (jquery selector).

<!--parse the specific business data, after processing is a map--<Page><Scope><RuleType="Regex"Value="Http://www.jokeji.cn/jokehtml/(. *). htm"/></Scope><FieldName="title" ><!--extract the processing chain for a field data--<Rules><RuleType="Jsoup"Value="Title"exp="HTML ()"/><RuleType="Cut" ><Pre></Pre><End>_</end> </ rule></rules> </field>< field name= "content" > <rules><rule type= "Jsoup" value=  "#text110"/></rules></field></ Page>               

5. Parsed data Persistence configuration (stores data from step 4): Target is a fixed value for Handleresultmapinterface, represents a persisted class, and classpath is a concrete implementation class. Com.hxt.webpasser.persistent.impl.DiskJsonHandleResult is the implementation class that is stored to the hard disk, which has RootDir (which folder to save to), CharSet (encoding when stored), See this class. You can customize the write persistence class, inherit the Handleresultmapinterface interface, and some properties are passed in with the configuration. (It is recommended that persistence is another independent project to provide data storage HTTP interface, the crawler requests the interface to push the data into, so separation and maintenance is more convenient, Example Com.hxt.webpasser.persistent.impl.VideoPushServcieImpl)

 <!--fetch parsed data persisted--<Resulthandlertarget= "handleresultmapinterface"  Classpath= "Com.hxt.webpasser.persistent.impl.DiskJsonHandleResult" > <property name=  "RootDir" value= "downdir/path/ Jokeji "></property> <property name= "CharSet" value= "GBK" ></ property> </resultHandler>  

6. After the configuration is written, you can join the task and start the test. Recommendations can be under a single test.

Use Webpasser to crawl a joke website entire site content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.