Original address: http://www.jianshu.com/p/c3fc3129407d
1. Reptile Frame WebMagic
WebMagic is a simple and flexible crawler framework. Based on WebMagic, you can quickly develop an efficient and maintainable crawler.
1.1 Official Address
The official website documents are written more clearly, it is recommended that you read the documents directly, you can also read the following content. The address is as follows:
Official website: Http://webmagic.io
Chinese document Address: http://webmagic.io/docs/zh/
english:http://webmagic.io/docs/en
2. WebMagic integration with spring boot framework
spring boot
and webmagic
The combination of the main has three modules, respectively, the crawl module, Processor
storage module Pipeline
, to the database into the crawl data, and scheduled task module Scheduled
, copy timed crawl site data.
2.1 Maven Add
<Dependency><Groupid>us.codecraft</Groupid><Artifactid>webmagic-core</artifactid> < version>0.5.3</version></dependency>< dependency> <groupId> Us.codecraft</groupid> <artifactid>webmagic-extension</ artifactid> <version>0.5.3</< Span class= "Hljs-name" >version></DEPENDENCY>
2.2 Crawl Module
Processor
Crawl Simple book home processor, analyze the page data of the book homepage, get the response of the simple book link and title, put in the Wegmagic page, to the inbound module to remove the add to the database. The code is as follows:
Package com.shang.spray.common.processor;Import Com.shang.spray.entity.News;Import com.shang.spray.entity.Sources;Import Com.shang.spray.pipeline.NewsPipeline;Import Us.codecraft.webmagic.Page;Import Us.codecraft.webmagic.Site;Import Us.codecraft.webmagic.Spider;Import Us.codecraft.webmagic.processor.PageProcessor;Import us.codecraft.webmagic.selector.Selectable;Import java.util.List;/** * Info: Simple book Home crawler * Created by Shang on 16/9/9. */PublicClassJianshuprocessorImplementsPageprocessor {Private Site site = site.me (). SetDomain ("Jianshu.com"). Setsleeptime ((setuseragent)."Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ");PublicStatic finalString list ="Http://www.jianshu.com"; @Overridepublic void Process (Page page) {if (Page.geturl (). Regex (list). Match ()) {list<selectable> list=page.gethtml (). XPath ("//ul[@class = ' article-list thumbnails ']/li"). nodes ();For (selectable s:List) {String Title=s.xpath ("//div/h4/a/text ()"). ToString ();String Link=s.xpath ("//div/h4"). Links (). ToString (); NewsNews=NewNews ();NewS.settitle (title);NewS.setinfo (title);NewS.setlink (link);News.setsources (new Sources (5)); Page.putfield ( "news" +title, news);}}} @Override public site Getsite () {return Site;} public static void Main (string[] args) {Spider Spider=spider.create (new jianshuprocessor ()); SPIDER.ADDURL ( Span class= "hljs-string" > "http://www.jianshu.com"); Spider.addpipeline (new newspipeline ()); Spider.thread (5); Spider.setexitwhencomplete (true); Spider.start ();}}
2.3 Inbound Module
Pipeline
The inbound module is combined with spring boot repository module into the storage method, inherits the pipeline of WebMagic, then implements the method, obtains the data of the crawler module in the process method, and then calls the spring boot save method. The code is as follows:
Package com.shang.spray.pipeline;Import Com.shang.spray.entity.News;Import com.shang.spray.entity.Sources;Import Com.shang.spray.repository.NewsRepository;Import Org.apache.commons.lang3.StringUtils;Import org.springframework.beans.factory.annotation.Autowired;Import org.springframework.data.jpa.domain.Specification;Import Org.springframework.stereotype.Repository;Import Us.codecraft.webmagic.ResultItems;Import Us.codecraft.webmagic.Task;Import Us.codecraft.webmagic.pipeline.Pipeline;Import Javax.persistence.criteria.CriteriaBuilder;Import Javax.persistence.criteria.CriteriaQuery;Import Javax.persistence.criteria.Predicate;Import Javax.persistence.criteria.Root;Import java.util.ArrayList;Import Java.util.Date;Import java.util.List;Import Java.util.Map;/** * Info: News * Created by Shang on 16/8/22. */@RepositoryPublicClassNewspipelineImplementsPipeline {@Autowired protected newsrepositoryNewSrepository; @Overridepublic void process (Resultitems resultitems, task Task) {for (map.entry<String, object> entry:Resultitems.getall (). EntrySet ()) {if (Entry.getkey (). Contains ("News") {News)NewS= (News) entry.getvalue (); Specification<news> specification=NewSpecification<news> () {@OverridePublic predicate topredicate (root<news> Root, criteriaquery<?> criteriaquery, Criteriabuilder Criteriabuilder) {Return Criteriabuilder.and (criteriabuilder.equal (Root).Get"Link"),NewS.getlink ())); } };if (NewSrepository.findone (specification) = =NULL) {Check if the link already existsNewS.setauthor ("Splash");newS.settypeid (1); newS.setsort (1); newS.setstatus (1); newS.setexplicitlink (true); news.setcreatedate (new Date ()); news.setmodifydate (new Date ()); newSrepository.save (news); }}}}
2.4 Timing Task Module
Scheduled
Using spring boot's own scheduled task annotations @Scheduled(cron = "0 0 0/2 * * ? ")
, a crawl task is performed every two hours, starting 0 days a day, and the WebMagic crawl module is picked up in a timed task Processor
. The code is as follows:
Package com.shang.spray.common.scheduled;Import Com.shang.spray.common.processor.DevelopersProcessor;Import Com.shang.spray.common.processor.JianShuProcessor;Import Com.shang.spray.common.processor.ZhiHuProcessor;Import Com.shang.spray.entity.Config;Import Com.shang.spray.pipeline.NewsPipeline;Import Com.shang.spray.service.ConfigService;Import org.springframework.beans.factory.annotation.Autowired;Import org.springframework.data.jpa.domain.Specification;Import org.springframework.scheduling.annotation.Scheduled;Import org.springframework.stereotype.Component;Import Us.codecraft.webmagic.Spider;Import Javax.persistence.criteria.CriteriaBuilder;Import Javax.persistence.criteria.CriteriaQuery;Import Javax.persistence.criteria.Predicate;Import Javax.persistence.criteria.Root;/** * Info: News Timer task * Created by Shang on 16/8/22. */@ComponentPublicClassnewsscheduled {@autowired private Span class= "Hljs-type" >newspipeline newspipeline; /** * Jane Book */@scheduled (cron = "0 0 0/2 * *?) ") //starting from 0 o'clock, every 2 hours public void jianshuscheduled () {system.out. println ( "----start to perform short-book Scheduled Tasks"); spider Spider = spider.create (New Jianshuprocessor ()); Spider.addurl ( "http://www.jianshu.com"); Spider.addpipeline (Newspipeline); Spider.thread (5); Spider.setexitwhencomplete (true); Spider.start (); Spider.stop (); }}
2.5 Spring boot Enable timed tasks
Enable timed task annotations in spring boot application @EnableScheduling
. The code is as follows:
Package Com.shang.spray;Import Org.springframework.boot.Springapplication;Import Org.springframework.boot.autoconfigure.Enableautoconfiguration;Import Org.springframework.boot.autoconfigure.Springbootapplication;Import Org.springframework.boot.builder.Springapplicationbuilder;Import Org.springframework.boot.context.web.Springbootservletinitializer;Import Org.springframework.context.annotation.Componentscan;Import Org.springframework.context.annotation.Configuration;Import Org.springframework.scheduling.annotation.enablescheduling;/** * Info: * Created by Shang on 16/7/8. */@Configuration@EnableAutoConfiguration@ComponentScan@SpringBootApplication@EnableSchedulingpublic class sprayapplication extends springbootservletinitializer{ @ Override protected springapplicationbuilder Configure (springapplicationbuilder application) { Return application.sources (sprayapplication. Class); } public static void Main (string[] args) throws Exception { Springapplication.run ( Sprayapplication. class, args); }}
3. Concluding remarks
Using WebMagic is the crawler frame that I used when I crawled the site data in a splash project, and after a few other reptile frames, I chose this framework, which is simple, easy to learn and powerful, and I use only basic functions here, and many powerful features are not used. If you are interested, you can go to the official documentation!
The crawler frame webmagic is used in conjunction with spring boot-