In many cases, when we crawl the site using WebMagic, the crawled data is expected to be stored in MySQL and Redis. Therefore, it needs to be extended to implement a custom pipeline. First, let's look at the four basic components of webmagic
First, the WebMagic four components 1, Downloader
Downloader is responsible for downloading pages from the Internet for subsequent processing. WebMagic uses httpclient as the download tool by default.
2, Pageprocessor
Pageprocessor is responsible for parsing the page, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool and develops a tool xsoup for parsing XPath based on it.
Of these four components, Pageprocessor is different for each page of each site and is a custom-made part of the user.
3, Scheduler
Scheduler is responsible for managing the URLs to be crawled, as well as some heavy work. The WebMagic default provides the JDK's memory queue to manage the URLs and use the collection to do the redo. Distributed management with Redis is also supported.
Unless the project has some special distributed requirements, you do not need to customize the scheduler yourself.
4, Pipeline
Pipeline is responsible for the processing of results, including calculation, persistence to file, database and so on. The WebMagic default provides both output to console and save to file result processing scenarios.
Pipeline defines how the results are saved, and if you want to save to the specified database, you need to write the corresponding pipeline. For a class of requirements, you simply write a pipeline.
Second, the custom pipeline, realizes the pipeline interface, from writes the process method.
PackageCom.mdd.pip.pipeLine;Importjava.util.List;Importorg.springframework.beans.factory.annotation.Autowired;Importorg.springframework.stereotype.Component;Importcom.mdd.pip.model.ProxyIp;ImportCom.mdd.pip.service.ProxyIpService;ImportUs.codecraft.webmagic.ResultItems;ImportUs.codecraft.webmagic.Task;Importus.codecraft.webmagic.pipeline.Pipeline; @Component Public classDatapipelineImplementsPipeline {@AutowiredPrivateProxyipservice Proxyipservice; /*** MySQL Storage*/ /** public void process (Resultitems resultitems, Task Task) {* List<proxyip>proxyiplist = resultitems.ge T ("proxyiplist"); * IF (Proxyiplist!=null&&!proxyiplist.isempty ()) {* Proxyipservice.saveproxyiplist (proxyIpList);} * * } */ /*** Redis Storage*/ Public voidprocess (Resultitems resultitems, Task Task) {List<ProxyIp> proxyiplist = Resultitems.get ("Proxyiplist"); if(Proxyiplist! =NULL&&!Proxyiplist.isempty ()) {Proxyipservice.saveproxylistipinredis (proxyiplist); } }}
Resultitems Object essence is a map. So when we save the object, we just need to encapsulate the crawl data as an object at crawl time, save it in
Resultitems. If you have a lot of data, you might consider saving it with a list.
PackageCom.mdd.pip.crawler;Importjava.util.ArrayList;Importjava.util.List;ImportOrg.apache.log4j.Logger;Importorg.jsoup.nodes.Document;Importorg.jsoup.nodes.Element;Importorg.jsoup.select.Elements;Importorg.springframework.stereotype.Component;Importcom.mdd.pip.model.ProxyIp;ImportUs.codecraft.webmagic.Page;ImportUs.codecraft.webmagic.Site;ImportUs.codecraft.webmagic.processor.PageProcessor;/*** Tottenham agent website IP Crawl * *@authorXWL 2017.6.3*/@Component Public classXiciproxyipcrawlerImplementsPageprocessor {PrivateLogger Logger = Logger.getlogger (Xiciproxyipcrawler.class); //Part One: Crawl the relevant configuration of the site, including encoding, crawl interval, number of retries, etc. PrivateSite site = site.me (). Setcycleretrytimes (3). Setretrytimes (3). Setsleeptime (1000). Setuseragent ("mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) gecko/20100101 firefox/53.0 "); PublicSite Getsite () {returnsite; } Public voidProcess (Page page) {Document HTML=page.gethtml (). GetDocument (); //result setList<proxyip> proxyiplist =NewArraylist<proxyip>(); Elements trelements= Html.getelementbyid ("Ip_list"). Getelementsbytag ("tr")); for(Element trele:trelements) {Elements tdelements= Trele.getelementsbytag ("TD"); if(Tdelements = =NULL|| Tdelements.size () <=0) { Continue; } Try{Proxyip Proxyip=NewProxyip (); String IP= Tdelements.get (1). text (); String ProxyPort= Tdelements.get (2). text (); String ipAddress= Tdelements.get (3). text (); String Anonymity= Tdelements.get (4). text (); String Proxytype= Tdelements.get (5). text (); String Alivetime= Tdelements.get (6). text (); Proxyip.setproxyip (IP); Proxyip.setproxyport (Integer.parseint (ProxyPort)); Proxyip.setalivetime (Alivetime); Proxyip.setanonymity (anonymity); Proxyip.setipaddress (ipAddress); Proxyip.setproxytype (Proxytype); Logger.info (Proxyip.getproxyip ()+":"+Proxyip.getproxyport ()); Proxyiplist.add (PROXYIP); } Catch(Exception e) {logger.error ("IP proxy parsing error!" ", E); }} Page.putfield ("Proxyiplist", proxyiplist); }}
Page.putfield ("Proxyiplist", proxyiplist); the essence is set to a value of Resultitems in the object.
So plug-in, customized value of our reference. Hope next look at the source code.
Reference:
Http://webmagic.io/docs/zh/posts/ch1-overview/architecture.html
WebMagic Custom Storage (MySQL, Redis storage)