The crawler frame webmagic is used in conjunction with spring boot-

Last Update:2017-08-03 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://www.jianshu.com/p/c3fc3129407d

1. Reptile Frame WebMagic

WebMagic is a simple and flexible crawler framework. Based on WebMagic, you can quickly develop an efficient and maintainable crawler.

1.1 Official Address

The official website documents are written more clearly, it is recommended that you read the documents directly, you can also read the following content. The address is as follows:

Official website: Http://webmagic.io

Chinese document Address: http://webmagic.io/docs/zh/

english:http://webmagic.io/docs/en

2. WebMagic integration with spring boot framework

spring bootand webmagic The combination of the main has three modules, respectively, the crawl module, Processor storage module Pipeline , to the database into the crawl data, and scheduled task module Scheduled , copy timed crawl site data.

2.1 Maven Add

<Dependency><Groupid>us.codecraft</Groupid><Artifactid>webmagic-core</artifactid> < version>0.5.3</version></dependency>< dependency> <groupId> Us.codecraft</groupid> <artifactid>webmagic-extension</ artifactid> <version>0.5.3</< Span class= "Hljs-name" >version></DEPENDENCY>

2.2 Crawl Module Processor

Crawl Simple book home processor, analyze the page data of the book homepage, get the response of the simple book link and title, put in the Wegmagic page, to the inbound module to remove the add to the database. The code is as follows:

Package com.shang.spray.common.processor;Import Com.shang.spray.entity.News;Import com.shang.spray.entity.Sources;Import Com.shang.spray.pipeline.NewsPipeline;Import Us.codecraft.webmagic.Page;Import Us.codecraft.webmagic.Site;Import Us.codecraft.webmagic.Spider;Import Us.codecraft.webmagic.processor.PageProcessor;Import us.codecraft.webmagic.selector.Selectable;Import java.util.List;/** * Info: Simple book Home crawler * Created by Shang on 16/9/9. */PublicClassJianshuprocessorImplementsPageprocessor {Private Site site = site.me (). SetDomain ("Jianshu.com"). Setsleeptime ((setuseragent)."Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ");PublicStatic finalString list ="Http://www.jianshu.com"; @Overridepublic void Process (Page page) {if (Page.geturl (). Regex (list). Match ()) {list<selectable> list=page.gethtml (). XPath ("//ul[@class = ' article-list thumbnails ']/li"). nodes ();For (selectable s:List) {String Title=s.xpath ("//div/h4/a/text ()"). ToString ();String Link=s.xpath ("//div/h4"). Links (). ToString (); NewsNews=NewNews ();NewS.settitle (title);NewS.setinfo (title);NewS.setlink (link);News.setsources (new Sources (5)); Page.putfield ( "news" +title, news);}}} @Override public site Getsite () {return Site;} public static void Main (string[] args) {Spider Spider=spider.create (new jianshuprocessor ()); SPIDER.ADDURL ( Span class= "hljs-string" > "http://www.jianshu.com"); Spider.addpipeline (new newspipeline ()); Spider.thread (5); Spider.setexitwhencomplete (true); Spider.start ();}}

2.3 Inbound Module Pipeline

The inbound module is combined with spring boot repository module into the storage method, inherits the pipeline of WebMagic, then implements the method, obtains the data of the crawler module in the process method, and then calls the spring boot save method. The code is as follows:

Package com.shang.spray.pipeline;Import Com.shang.spray.entity.News;Import com.shang.spray.entity.Sources;Import Com.shang.spray.repository.NewsRepository;Import Org.apache.commons.lang3.StringUtils;Import org.springframework.beans.factory.annotation.Autowired;Import org.springframework.data.jpa.domain.Specification;Import Org.springframework.stereotype.Repository;Import Us.codecraft.webmagic.ResultItems;Import Us.codecraft.webmagic.Task;Import Us.codecraft.webmagic.pipeline.Pipeline;Import Javax.persistence.criteria.CriteriaBuilder;Import Javax.persistence.criteria.CriteriaQuery;Import Javax.persistence.criteria.Predicate;Import Javax.persistence.criteria.Root;Import java.util.ArrayList;Import Java.util.Date;Import java.util.List;Import Java.util.Map;/** * Info: News * Created by Shang on 16/8/22. */@RepositoryPublicClassNewspipelineImplementsPipeline {@Autowired protected newsrepositoryNewSrepository; @Overridepublic void process (Resultitems resultitems, task Task) {for (map.entry<String, object> entry:Resultitems.getall (). EntrySet ()) {if (Entry.getkey (). Contains ("News") {News)NewS= (News) entry.getvalue (); Specification<news> specification=NewSpecification<news> () {@OverridePublic predicate topredicate (root<news> Root, criteriaquery<?> criteriaquery, Criteriabuilder Criteriabuilder) {Return Criteriabuilder.and (criteriabuilder.equal (Root).Get"Link"),NewS.getlink ())); } };if (NewSrepository.findone (specification) = =NULL) {Check if the link already existsNewS.setauthor ("Splash");newS.settypeid (1); newS.setsort (1); newS.setstatus (1); newS.setexplicitlink (true); news.setcreatedate (new Date ()); news.setmodifydate (new Date ()); newSrepository.save (news);           }}}}

2.4 Timing Task Module Scheduled

Using spring boot's own scheduled task annotations @Scheduled(cron = "0 0 0/2 * * ? ") , a crawl task is performed every two hours, starting 0 days a day, and the WebMagic crawl module is picked up in a timed task Processor . The code is as follows:

Package com.shang.spray.common.scheduled;Import Com.shang.spray.common.processor.DevelopersProcessor;Import Com.shang.spray.common.processor.JianShuProcessor;Import Com.shang.spray.common.processor.ZhiHuProcessor;Import Com.shang.spray.entity.Config;Import Com.shang.spray.pipeline.NewsPipeline;Import Com.shang.spray.service.ConfigService;Import org.springframework.beans.factory.annotation.Autowired;Import org.springframework.data.jpa.domain.Specification;Import org.springframework.scheduling.annotation.Scheduled;Import org.springframework.stereotype.Component;Import Us.codecraft.webmagic.Spider;Import Javax.persistence.criteria.CriteriaBuilder;Import Javax.persistence.criteria.CriteriaQuery;Import Javax.persistence.criteria.Predicate;Import Javax.persistence.criteria.Root;/** * Info: News Timer task * Created by Shang on 16/8/22. */@ComponentPublicClassnewsscheduled {@autowired private Span class= "Hljs-type" >newspipeline newspipeline; /** * Jane Book */@scheduled (cron =  "0 0 0/2 * *?) ") //starting from 0 o'clock, every 2 hours public void jianshuscheduled () {system.out. println ( "----start to perform short-book Scheduled Tasks"); spider Spider = spider.create (New  Jianshuprocessor ()); Spider.addurl ( "http://www.jianshu.com"); Spider.addpipeline (Newspipeline); Spider.thread (5); Spider.setexitwhencomplete (true); Spider.start (); Spider.stop (); }}

2.5 Spring boot Enable timed tasks

Enable timed task annotations in spring boot application @EnableScheduling . The code is as follows:

Package Com.shang.spray;Import Org.springframework.boot.Springapplication;Import Org.springframework.boot.autoconfigure.Enableautoconfiguration;Import Org.springframework.boot.autoconfigure.Springbootapplication;Import Org.springframework.boot.builder.Springapplicationbuilder;Import Org.springframework.boot.context.web.Springbootservletinitializer;Import Org.springframework.context.annotation.Componentscan;Import Org.springframework.context.annotation.Configuration;Import Org.springframework.scheduling.annotation.enablescheduling;/** * Info: * Created by Shang on 16/7/8. */@Configuration@EnableAutoConfiguration@ComponentScan@SpringBootApplication@EnableSchedulingpublic class sprayapplication extends springbootservletinitializer{ @ Override protected springapplicationbuilder Configure (springapplicationbuilder application) {  Return application.sources (sprayapplication.  Class); } public static void Main (string[] args) throws Exception { Springapplication.run (  Sprayapplication. class, args); }}

3. Concluding remarks

Using WebMagic is the crawler frame that I used when I crawled the site data in a splash project, and after a few other reptile frames, I chose this framework, which is simple, easy to learn and powerful, and I use only basic functions here, and many powerful features are not used. If you are interested, you can go to the official documentation!

The crawler frame webmagic is used in conjunction with spring boot-

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More