Use Webpasser to crawl a joke website entire site content

Last Update:2018-08-17 Source: Internet

Author: User

Tags http cookie

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use the Webpasser framework to crawl the entire site content of a joke website. The Webpasser is a configurable crawler framework with a built-in page parsing engine that can quickly configure a crawler task. The configuration method separates page parsing from data storage and can be repaired quickly if the target site is revised.

The configuration instructions are as follows (see Http://git.oschina.net/passer/webpasser for the complete configuration of this example):

1. Write the total Crawl parameters: page encoding is GBK, the request time-out is 5 seconds, the request failed to retry 5 times, crawl failed to wait 10 seconds, set 10 threads crawl, after each crawl do not wait. This does not set the request header information, cookie, Proxy.

<Fetchconfigcharset="GBK"Timeoutsecond="5"errorretry="5"Errordelaytime="10"runthreadnum="10"Fetchpreparedelaytime="0" ><useragent>mozilla/5.0 (compatible; webpasser;)</Useragent><Headers><!--</headers><!--HTTP Cookie-- ><cookies><!--<cookie Name= "Cookie1" value= "host=" "path=" "/> <cookie name=" Cookie2 "value=" 1 "/>--> </cookies><!--proxy settings: From Ip.txt in bulk read IP, each fetch randomly use an IP-- ><!--<proxies path= "Ip.txt" ></proxies>- <!--a single agent get, Pollurl system link to randomly return a proxy IP, the format is Ip:port (when using the proxy tag proxies invalid)--><!--<proxy pollurl= "Http://localhost:8083/proxyManage/pollProxyIp.action?task=xunbo" ></proxy>--></FETCHCONFIG>

2.scope indicates the range of linked domain names that restrict crawling (note that limithost must be a domain name, cannot be added to HTTP or/), and seeds indicates that it starts crawling from these seed entries.

<scope><limitHost value="www.jokeji.cn" /></scope> <!-- 种子 --><seeds><seed url="http://www.jokeji.cn/" /> </seeds>

3. If the link format of the Web page conforms to the scope rules, then the parsing policy for this page is entered. Diglink is used to dig up new links from a Web page. Use the Jsoup syntax to specify a new link to be dug from all a tags. Because the new link is a relative link, the relative link is converted to an absolute link using the Relativetofullurl processing chain.

<Page><Scope><RuleType="Regex"Value="Http://www.jokeji.cn (. *)"/></scope><!--link dug-up <diglink > < Rules><rule type=  "Jsoup" value= "a[^href" attr= "href"/> <rule Span class= "hljs-attr" >type= "Relativetofullurl"/> </rules> </diglink> </PAGE>

4. Extract the specific business data that you want in the Details page, and then return the specified map data after parsing.

(1) Get Header data: The contents of the title tag are obtained through the Jsoup syntax (jquery selector), and the content after "_" is removed from the "_" by the "cut" note (intercept processing chain), because the content after "_" in the title content is not required.

(2) Get the data for the article content: Get the contents of the tag with ID text110 through the jsoup syntax (jquery selector).

<!--parse the specific business data, after processing is a map--<Page><Scope><RuleType="Regex"Value="Http://www.jokeji.cn/jokehtml/(. *). htm"/></Scope><FieldName="title" ><!--extract the processing chain for a field data--<Rules><RuleType="Jsoup"Value="Title"exp="HTML ()"/><RuleType="Cut" ><Pre></Pre><End>_</end> </ rule></rules> </field>< field name= "content" > <rules><rule type= "Jsoup" value=  "#text110"/></rules></field></ Page>

5. Parsed data Persistence configuration (stores data from step 4): Target is a fixed value for Handleresultmapinterface, represents a persisted class, and classpath is a concrete implementation class. Com.hxt.webpasser.persistent.impl.DiskJsonHandleResult is the implementation class that is stored to the hard disk, which has RootDir (which folder to save to), CharSet (encoding when stored), See this class. You can customize the write persistence class, inherit the Handleresultmapinterface interface, and some properties are passed in with the configuration. (It is recommended that persistence is another independent project to provide data storage HTTP interface, the crawler requests the interface to push the data into, so separation and maintenance is more convenient, Example Com.hxt.webpasser.persistent.impl.VideoPushServcieImpl)

 <!--fetch parsed data persisted--<Resulthandlertarget= "handleresultmapinterface"  Classpath= "Com.hxt.webpasser.persistent.impl.DiskJsonHandleResult" > <property name=  "RootDir" value= "downdir/path/ Jokeji "></property> <property name= "CharSet" value= "GBK" ></ property> </resultHandler>

6. After the configuration is written, you can join the task and start the test. Recommendations can be under a single test.

Use Webpasser to crawl a joke website entire site content

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More