The crawler frame webmagic is used in conjunction with spring boot-

Source: Internet
Author: User
Tags xpath

Original address: http://www.jianshu.com/p/c3fc3129407d

1. Reptile Frame WebMagic

WebMagic is a simple and flexible crawler framework. Based on WebMagic, you can quickly develop an efficient and maintainable crawler.

1.1 Official Address

The official website documents are written more clearly, it is recommended that you read the documents directly, you can also read the following content. The address is as follows:

Official website: Http://webmagic.io

Chinese document Address: http://webmagic.io/docs/zh/

english:http://webmagic.io/docs/en

2. WebMagic integration with spring boot framework

spring bootand webmagic The combination of the main has three modules, respectively, the crawl module, Processor storage module Pipeline , to the database into the crawl data, and scheduled task module Scheduled , copy timed crawl site data.

2.1 Maven Add
<Dependency><Groupid>us.codecraft</Groupid><Artifactid>webmagic-core</artifactid> < version>0.5.3</version></dependency>< dependency> <groupId> Us.codecraft</groupid> <artifactid>webmagic-extension</ artifactid> <version>0.5.3</< Span class= "Hljs-name" >version></DEPENDENCY>   
2.2 Crawl Module Processor

Crawl Simple book home processor, analyze the page data of the book homepage, get the response of the simple book link and title, put in the Wegmagic page, to the inbound module to remove the add to the database. The code is as follows:

Package com.shang.spray.common.processor;Import Com.shang.spray.entity.News;Import com.shang.spray.entity.Sources;Import Com.shang.spray.pipeline.NewsPipeline;Import Us.codecraft.webmagic.Page;Import Us.codecraft.webmagic.Site;Import Us.codecraft.webmagic.Spider;Import Us.codecraft.webmagic.processor.PageProcessor;Import us.codecraft.webmagic.selector.Selectable;Import java.util.List;/** * Info: Simple book Home crawler * Created by Shang on 16/9/9. */PublicClassJianshuprocessorImplementsPageprocessor {Private Site site = site.me (). SetDomain ("Jianshu.com"). Setsleeptime ((setuseragent)."Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ");PublicStatic finalString list ="Http://www.jianshu.com"; @Overridepublic void Process (Page page) {if (Page.geturl (). Regex (list). Match ()) {list<selectable> list=page.gethtml (). XPath ("//ul[@class = ' article-list thumbnails ']/li"). nodes ();For (selectable s:List) {String Title=s.xpath ("//div/h4/a/text ()"). ToString ();String Link=s.xpath ("//div/h4"). Links (). ToString (); NewsNews=NewNews ();NewS.settitle (title);NewS.setinfo (title);NewS.setlink (link);News.setsources (new Sources (5)); Page.putfield ( "news" +title, news);}}} @Override public site Getsite () {return Site;} public static void Main (string[] args) {Spider Spider=spider.create (new jianshuprocessor ()); SPIDER.ADDURL ( Span class= "hljs-string" > "http://www.jianshu.com"); Spider.addpipeline (new newspipeline ()); Spider.thread (5); Spider.setexitwhencomplete (true); Spider.start ();}}       
2.3 Inbound Module Pipeline

The inbound module is combined with spring boot repository module into the storage method, inherits the pipeline of WebMagic, then implements the method, obtains the data of the crawler module in the process method, and then calls the spring boot save method. The code is as follows:

Package com.shang.spray.pipeline;Import Com.shang.spray.entity.News;Import com.shang.spray.entity.Sources;Import Com.shang.spray.repository.NewsRepository;Import Org.apache.commons.lang3.StringUtils;Import org.springframework.beans.factory.annotation.Autowired;Import org.springframework.data.jpa.domain.Specification;Import Org.springframework.stereotype.Repository;Import Us.codecraft.webmagic.ResultItems;Import Us.codecraft.webmagic.Task;Import Us.codecraft.webmagic.pipeline.Pipeline;Import Javax.persistence.criteria.CriteriaBuilder;Import Javax.persistence.criteria.CriteriaQuery;Import Javax.persistence.criteria.Predicate;Import Javax.persistence.criteria.Root;Import java.util.ArrayList;Import Java.util.Date;Import java.util.List;Import Java.util.Map;/** * Info: News * Created by Shang on 16/8/22. */@RepositoryPublicClassNewspipelineImplementsPipeline {@Autowired protected newsrepositoryNewSrepository; @Overridepublic void process (Resultitems resultitems, task Task) {for (map.entry<String, object> entry:Resultitems.getall (). EntrySet ()) {if (Entry.getkey (). Contains ("News") {News)NewS= (News) entry.getvalue (); Specification<news> specification=NewSpecification<news> () {@OverridePublic predicate topredicate (root<news> Root, criteriaquery<?> criteriaquery, Criteriabuilder Criteriabuilder) {Return Criteriabuilder.and (criteriabuilder.equal (Root).Get"Link"),NewS.getlink ())); } };if (NewSrepository.findone (specification) = =NULL) {Check if the link already existsNewS.setauthor ("Splash");newS.settypeid (1); newS.setsort (1); newS.setstatus (1); newS.setexplicitlink (true); news.setcreatedate (new Date ()); news.setmodifydate (new Date ()); newSrepository.save (news);           }}}} 
2.4 Timing Task Module Scheduled

Using spring boot's own scheduled task annotations @Scheduled(cron = "0 0 0/2 * * ? ") , a crawl task is performed every two hours, starting 0 days a day, and the WebMagic crawl module is picked up in a timed task Processor . The code is as follows:

Package com.shang.spray.common.scheduled;Import Com.shang.spray.common.processor.DevelopersProcessor;Import Com.shang.spray.common.processor.JianShuProcessor;Import Com.shang.spray.common.processor.ZhiHuProcessor;Import Com.shang.spray.entity.Config;Import Com.shang.spray.pipeline.NewsPipeline;Import Com.shang.spray.service.ConfigService;Import org.springframework.beans.factory.annotation.Autowired;Import org.springframework.data.jpa.domain.Specification;Import org.springframework.scheduling.annotation.Scheduled;Import org.springframework.stereotype.Component;Import Us.codecraft.webmagic.Spider;Import Javax.persistence.criteria.CriteriaBuilder;Import Javax.persistence.criteria.CriteriaQuery;Import Javax.persistence.criteria.Predicate;Import Javax.persistence.criteria.Root;/** * Info: News Timer task * Created by Shang on 16/8/22. */@ComponentPublicClassnewsscheduled {@autowired private Span class= "Hljs-type" >newspipeline newspipeline; /** * Jane Book */@scheduled (cron =  "0 0 0/2 * *?) ") //starting from 0 o'clock, every 2 hours public void jianshuscheduled () {system.out. println ( "----start to perform short-book Scheduled Tasks"); spider Spider = spider.create (New  Jianshuprocessor ()); Spider.addurl ( "http://www.jianshu.com"); Spider.addpipeline (Newspipeline); Spider.thread (5); Spider.setexitwhencomplete (true); Spider.start (); Spider.stop (); }} 
2.5 Spring boot Enable timed tasks

Enable timed task annotations in spring boot application @EnableScheduling . The code is as follows:

Package Com.shang.spray;Import Org.springframework.boot.Springapplication;Import Org.springframework.boot.autoconfigure.Enableautoconfiguration;Import Org.springframework.boot.autoconfigure.Springbootapplication;Import Org.springframework.boot.builder.Springapplicationbuilder;Import Org.springframework.boot.context.web.Springbootservletinitializer;Import Org.springframework.context.annotation.Componentscan;Import Org.springframework.context.annotation.Configuration;Import Org.springframework.scheduling.annotation.enablescheduling;/** * Info: * Created by Shang on 16/7/8. */@Configuration@EnableAutoConfiguration@ComponentScan@SpringBootApplication@EnableSchedulingpublic class sprayapplication extends springbootservletinitializer{ @ Override protected springapplicationbuilder Configure (springapplicationbuilder application) {  Return application.sources (sprayapplication.  Class); } public static void Main (string[] args) throws Exception { Springapplication.run (  Sprayapplication. class, args); }}
3. Concluding remarks

Using WebMagic is the crawler frame that I used when I crawled the site data in a splash project, and after a few other reptile frames, I chose this framework, which is simple, easy to learn and powerful, and I use only basic functions here, and many powerful features are not used. If you are interested, you can go to the official documentation!

The crawler frame webmagic is used in conjunction with spring boot-

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.