WebMagic Custom Storage (MySQL, Redis storage)

Source: Internet
Author: User
Tags website ip

In many cases, when we crawl the site using WebMagic, the crawled data is expected to be stored in MySQL and Redis. Therefore, it needs to be extended to implement a custom pipeline. First, let's look at the four basic components of webmagic

First, the WebMagic four components 1, Downloader

Downloader is responsible for downloading pages from the Internet for subsequent processing. WebMagic uses httpclient as the download tool by default.

2, Pageprocessor

Pageprocessor is responsible for parsing the page, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool and develops a tool xsoup for parsing XPath based on it.

Of these four components, Pageprocessor is different for each page of each site and is a custom-made part of the user.

3, Scheduler

Scheduler is responsible for managing the URLs to be crawled, as well as some heavy work. The WebMagic default provides the JDK's memory queue to manage the URLs and use the collection to do the redo. Distributed management with Redis is also supported.

Unless the project has some special distributed requirements, you do not need to customize the scheduler yourself.

4, Pipeline

Pipeline is responsible for the processing of results, including calculation, persistence to file, database and so on. The WebMagic default provides both output to console and save to file result processing scenarios.

Pipeline defines how the results are saved, and if you want to save to the specified database, you need to write the corresponding pipeline. For a class of requirements, you simply write a pipeline.

Second, the custom pipeline, realizes the pipeline interface, from writes the process method.
 PackageCom.mdd.pip.pipeLine;Importjava.util.List;Importorg.springframework.beans.factory.annotation.Autowired;Importorg.springframework.stereotype.Component;Importcom.mdd.pip.model.ProxyIp;ImportCom.mdd.pip.service.ProxyIpService;ImportUs.codecraft.webmagic.ResultItems;ImportUs.codecraft.webmagic.Task;Importus.codecraft.webmagic.pipeline.Pipeline; @Component Public classDatapipelineImplementsPipeline {@AutowiredPrivateProxyipservice Proxyipservice; /*** MySQL Storage*/    /** public void process (Resultitems resultitems, Task Task) {* List<proxyip>proxyiplist = resultitems.ge     T ("proxyiplist");     * IF (Proxyiplist!=null&&!proxyiplist.isempty ()) {* Proxyipservice.saveproxyiplist (proxyIpList);} *      * }     */    /*** Redis Storage*/     Public voidprocess (Resultitems resultitems, Task Task) {List<ProxyIp> proxyiplist = Resultitems.get ("Proxyiplist"); if(Proxyiplist! =NULL&&!Proxyiplist.isempty ())        {Proxyipservice.saveproxylistipinredis (proxyiplist); }    }}
Resultitems Object essence is a map. So when we save the object, we just need to encapsulate the crawl data as an object at crawl time, save it in
Resultitems. If you have a lot of data, you might consider saving it with a list.
 PackageCom.mdd.pip.crawler;Importjava.util.ArrayList;Importjava.util.List;ImportOrg.apache.log4j.Logger;Importorg.jsoup.nodes.Document;Importorg.jsoup.nodes.Element;Importorg.jsoup.select.Elements;Importorg.springframework.stereotype.Component;Importcom.mdd.pip.model.ProxyIp;ImportUs.codecraft.webmagic.Page;ImportUs.codecraft.webmagic.Site;ImportUs.codecraft.webmagic.processor.PageProcessor;/*** Tottenham agent website IP Crawl * *@authorXWL 2017.6.3*/@Component Public classXiciproxyipcrawlerImplementsPageprocessor {PrivateLogger Logger = Logger.getlogger (Xiciproxyipcrawler.class); //Part One: Crawl the relevant configuration of the site, including encoding, crawl interval, number of retries, etc.    PrivateSite site = site.me (). Setcycleretrytimes (3). Setretrytimes (3). Setsleeptime (1000). Setuseragent ("mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) gecko/20100101 firefox/53.0 ");  PublicSite Getsite () {returnsite; }         Public voidProcess (Page page) {Document HTML=page.gethtml (). GetDocument (); //result setList<proxyip> proxyiplist =NewArraylist<proxyip>(); Elements trelements= Html.getelementbyid ("Ip_list"). Getelementsbytag ("tr"));  for(Element trele:trelements) {Elements tdelements= Trele.getelementsbytag ("TD"); if(Tdelements = =NULL|| Tdelements.size () <=0) {                Continue; }            Try{Proxyip Proxyip=NewProxyip (); String IP= Tdelements.get (1). text (); String ProxyPort= Tdelements.get (2). text (); String ipAddress= Tdelements.get (3). text (); String Anonymity= Tdelements.get (4). text (); String Proxytype= Tdelements.get (5). text (); String Alivetime= Tdelements.get (6). text ();                Proxyip.setproxyip (IP);                Proxyip.setproxyport (Integer.parseint (ProxyPort));                Proxyip.setalivetime (Alivetime);                Proxyip.setanonymity (anonymity);                Proxyip.setipaddress (ipAddress);                Proxyip.setproxytype (Proxytype); Logger.info (Proxyip.getproxyip ()+":"+Proxyip.getproxyport ());            Proxyiplist.add (PROXYIP); } Catch(Exception e) {logger.error ("IP proxy parsing error!" ", E); }} Page.putfield ("Proxyiplist", proxyiplist); }}
Page.putfield ("Proxyiplist", proxyiplist); the essence is set to a value of  Resultitems in the object.

So plug-in, customized value of our reference. Hope next look at the source code.

Reference:
Http://webmagic.io/docs/zh/posts/ch1-overview/architecture.html

WebMagic Custom Storage (MySQL, Redis storage)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.