Webharvest capture Naruto comics

Source: Internet
Author: User

Do you think the update of the fire shadows is slow? Do you think the cartoon websites cannot download them? Look at this ^_^

PS: Web-harvest http://web-harvest.sourceforge.net



1. Logical File

 

<? XML version = "1.0" encoding = "UTF-8"?> <Br/> <config> <br/> <include Path = "functions. XML "/> </P> <p> <var-Def name =" num "Overwrite =" false "> 1 </var-Def> <br/> <loop index = "I" item = "url"> <br/> <! -- Get list of name --> <br/> <list> <br/> <var-Def name = "imagelinks"> <br/> <call name = "Download-multipage -list "> </P> <p> <call-Param name =" pageurl "> <template> http://www.narutom.com/comic/index.html </template> </call-param> </P> <p> <call-Param name = "nextxpath"> // Div [@ class = 'pagenav']/A [last () -1]/@ href </call-param> </P> <p> <call-Param name = "itemxpath"> // Div [@ ID = 'dm _ name ']/ul/Li/A/text () </call-param> </P> <P> <call-Param name = "maxloops"> <template >$ {num} </template> </call-param> <br/> </call> <br/> </var-Def> <br/> </List> <br/> <body> <br/> <empty> <br/> <! -- Get ordinal --> <br/> <var-Def name = "ordinal"> <br/> <Regexp-pattern> ^/D *( /D *)? /D * $ </Regexp-pattern> <br/> <Regexp-source> <template >$ {URL} </template> </Regexp-source> <br/> <Regexp-result> <br/> <template >$ {_ 1} </template> <br/> </Regexp-result>-<br/> </Regexp> <br/> </var-Def> <br/> <! -- Output --> <br/> <call name = "getcomic"> <br/> <call-Param name = "fromnum"> <template >$ {ordinal} </template> </call-param> <br/> <call-Param name = "directory"> <template >$ {URL} </template> </call-param> <br /> </call> <br/> </empty> <br/> </body> <br/> </loop> <br/> </config>

 

 

 

2. function library files

 

 

<? XML version = "1.0" encoding = "UTF-8"?> <Br/> <config> <br/> <! -- <Br/> download multi-page list of items. </P> <p> @ Param pageurl-URL of starting page <br/> @ Param itemxpath-XPath expression to obtain single item in the list <br/> @ Param nextxpath- XPATH expression to URL for the next page <br/> @ Param maxloops-maximum number of pages downloaded </P> <p> @ return list of all downloaded items <br/> --> <br/> <function name = "Download-multipage-list"> <br/> <r Eturn> <br/> <while condition = "$ {pageurl. tostring (). Length ()! = 0} "maxloops =" $ {maxloops} "Index =" I "> <br/> <empty> <br/> <var-Def name =" content "> <br/> <HTML-to-XML> <br/> <pttp url = "$ {pageurl}" charset = "gb2312"/> <br/> </ptml- -XML> <br/> </var-Def> <br/> <var-Def name = "nextlinkurl"> <br/> <XPath expression = "$ {nextxpath} "> <br/> <var name =" content "/> <br/> </XPath> <br/> </var-Def> <br/> <var- def name = "pageurl"> <br/> <! -- <Template >$ {sys. fullurl (pageurl. tostring (), nextlinkurl. tostring ()} </template> --> <br/> <template >$ {nextlinkurl. tostring ()} </template> <br/> </var-Def> <br/> </empty> </P> <p> <XPath expression = "$ {itemxpath}"> <br/> <var name = "content"/> <br/> </XPath> <br/> </while> <br/> </return> <br/> </function> </P> <p> <! -- Naruto --> <br/> <function name = "getcomic"> <br/> <while Index = "J" condition = "$ {J. toint ()! = 20} "> <br/> <var-Def name =" pageurl "> <br/> <template> watermark </template> <br/> </var-def> <br/> <file action = "write" Path = '/home/xyzqing/webharvest/Naruto/$ {directory}/canonical fig 'Type = "binary"> <br/> <pttp url = "$ {pageurl}"/> <br/> </File> <br/> </while> <br/> </function> <br/> </config>

 

 

3. Effect

 

 

 


 

 

PS: there may be a few useless images, which is a flaw in technology, but it does not affect watching for fans. In addition, the special article is not extracted because the ID is not consecutive. You can modify the example by yourself. After all, there are only a few articles.

Running Method: Download The webharvest jar package and run "logical file" with its built-in UI"
You can.
Of course, you must configure the output path yourself.

 

 

Welcome to the discussion.

 


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.