Many enterprises require the use of reptiles to crawl product information, the general development model is as follows:
For i=1;i<= maximum page number; i++ list page url= Item List page url+?page=i (page number) List page = crawl (List page URL) product link list = Extract item link (List page) for link In Product link list: product page = crawl (link) Extract (product page);
Such a model may seem simple, but there are a few questions:
1) crawler does not have thread pool support.
2) There is no breakpoint mechanism.
3) There is no crawl state store, crawl commodity website often appear the server reject link (ask too many times), cause once appear
A link is rejected, and some pages are not crawled. Instead of crawling the status record, the crawler needs to crawl again to get full data.
4) Poor code readability when extracting business complexity (no fixed frame)
Many enterprises to solve the above problem, and did not choose Nutch, crawler4j such a crawler framework, because these crawlers are based on breadth traversal, the above business is simple double loop, but is not breadth traversal. But actually this double loop, which can be transformed into breadth traversal, is equivalent to a crawl (seed list) based on a URL list when the breadth of the traversed layer is 1. The loop in the above business is actually crawling based on the URL list. The pseudo-code above is a double loop, so it can be split into 2-time breadth traversal to complete.
We design two breadth-ergodic linkcrawler and Productcrawler:
1) Linkcrawler is responsible for traversing the product List page, extracting the URL of each product detail page, injecting the extracted URL (inject) into the Productcrawler
2) Productcrawler The URL of Linkcrawler injected as seed, crawl, extract each item detail page.
Here is an example of the Webcollector crawler framework, which gives a collection of crawled public reviews:
Import Java.io.file;import java.io.ioexception;import java.util.regex.pattern;import org.jsoup.nodes.Document; Import Org.jsoup.nodes.element;import Org.jsoup.select.elements;import Cn.edu.hfut.dmic.webcollector.crawler.breadthcrawler;import Cn.edu.hfut.dmic.webcollector.generator.Injector; Import Cn.edu.hfut.dmic.webcollector.model.page;import Cn.edu.hfut.dmic.webcollector.util.config;import cn.edu.hfut.dmic.webcollector.util.fileutils;/** * Crawl Mass Review the crawler of group buying information many extracts of the crawler, not using a simple breadth traversal algorithm, but in two steps to complete: * 1. Use the loop through the Item List page to extract the URL of each item Detail Page 2. Extract each Item detail page * Most crawlers often only support breadth traversal, so many people choose to use the loop to do the above extraction operation, often do not enjoy the crawler framework provided by the thread pool, Features such as exception handling and breakpoint support. * * In fact, the above extraction task can be split into 2 times the breadth of the traverse to complete. When the breadth traversal of the number of layers is 1, equivalent to the URL list-based crawl (seed list) * We design two breadth of the Walker Linkcrawler and Productcrawler * 1) Linkcrawler is responsible for traversing the Product list page, Extract the URL of each product detail page, extract the extracted URL (inject) into Productcrawler * 2) Productcrawler the URL injected by Linkcrawler to seed, crawl, and extract each item detail page. * * @author hu */public class Dazhongdemo {public static class Linkcrawler extends Breadthcrawler {Injector in Jector; Public Linkcrawler (String linkpath, String productpath) {Setcrawlpath (linkpath); /* Inject the seed into the Productcrawler crawler */injector = new Injector (Productpath); /*linkcrawler is responsible for traversing the Product List page, I is the page number */for (int i = 1; i < 3; i++) {Addseed ("http://t.dianping.com/l ist/hefei-category_1?pageno= "+ i); } addregex (". *"); } @Override public void Visit (Page page) {Document doc = Page.getdoc (); Elements links = doc.select ("li[class^=floor]>a[track]"); For Element link:links {/*href is a product detail page extracted from the Product List page url*/String href = link.attr ("Abs:hre F "); System.out.println (HREF); Synchronized (injector) {try {* * * to inject the URL of the commodity detail page into the Productcrawler as seed */ Injector.inject (href, true); } catch (IOException ex) {} }}}/*config.topn=0 in the case of a breadth traversal of depth of 1, equivalent to traversing the seed list */public void start () thro WS IOException {start (1); }} public static class Productcrawler extends Breadthcrawler {public Productcrawler (String Productpath) { Setcrawlpath (Productpath); Addregex (". *"); Setresumable (TRUE); Setthreads (5); @Override public void Visit (Page page) {/* Determines if the page is a product detail page, this program can omit */if (! Pattern.matches ("http://t.dianping.com/deal/[0-9]+", Page.geturl ())) {return; } Document doc = Page.getdoc (); String name = Doc.select ("H1.title"). First (). text (); String Price = Doc.select ("Span.price-display"). First (). text (); String Origin_price = Doc.select ("Span.price-original"). First (). text (); String validatedate = Doc.select ("Div.validate-date"). First (). text (); System.ouT.PRINTLN (name + "" + Price + "/" + Origin_price + validatedate); In the case of/*config.topn=0, the depth is 1 breadth traversal, equivalent to the traversal of the seed list */public void Start () throws IOException {start (1); }} public static void Main (string[] args) throws IOException {/* CONFIG.TOPN indicates that the crawler does link analysis on the number of links Limit, because this program only requires traversing the seed URL list, do not need to continue crawling according to the link, so set to 0 */config.topn = 0; /* Crawling of each crawler relies on a folder, which stores and maintains crawl information there are two crawlers, so you need to set up two crawl folders */String Linkpath = "Crawl_link "; String Productpath = "Crawl_product"; File Productdir = new file (Productpath); if (productdir.exists ()) {Fileutils.deletedir (productdir); } Linkcrawler Linkcrawler = new Linkcrawler (linkpath, Productpath); Linkcrawler.start (); Productcrawler Productcrawler = new Productcrawler (Productpath); Productcrawler.start (); }}
Make a crawler crawling product information in Java (Crawl public reviews)