(4) What should Scrapy do for Distributed crawlers?-automatic rule crawling and command line passing parameters; crawler scrapy
The topic of this discussion is the implementation of rule crawling and the transmission of custom parameters under the command line. crawlers under the rule are actually crawlers in my opinion.
Logically, we choose how this crawler works:
We give a starting point url link. after entering the page, we extract all ur links. We define a rule to extract the desired connection form according to the rule (restricted by regular expressions, then crawl these pages, perform step-by-step processing (data extraction or other actions), and then repeat the above operations until they are stopped. A potential problem at this time is repeated crawling, in the scrapy framework, we have already started to deal with these problems. Generally, for the crawling and filtering problem, the general solution is to create an address table, check whether the address table has been crawled before crawling. If yes, filter it out directly. Another method is to use the off-the-shelf general solution, bloom filter
This article discusses how to use CrawlSpider to crawl information of all groups under the watercress label:
1. We create a new class that inherits from CrawlSpider.
from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):
For more information about CrawlSpider, see: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
2. To complete the parameter transfer in the command line, we need to input the desired parameter in the class constructor.
:
Use the following command line:
Scrapy crawl douban. xp -- logfile = test. log-a target = % E6 % 96% E5 % 87% B7
In this way, the custom parameters can be passed in.
The last line: super (MySpider, self). _ init __()
Go to the definition to view the definition of the crawler:
The constructor calls a private method to compile the rules variable. If no method is called in the self-defined Spider, an error is reported directly.
3. Write rules:
self.rules = ( Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True), )
Allow defines the tag style to be extracted and uses regular expression matching. restrict_xpaths strictly limits the range of the tag within the specified tag, callback, And the callback function after extraction.
4. For all the code, refer:
from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider): name = 'douban.xp' current = '' allowed_domains = ['douban.com'] def __init__(self, target=None): if self.current is not '': target = self.current if target is not None: self.current = target self.start_urls = [ 'http://www.douban.com/group/explore?tag=%s' % (target) ] self.rules = ( Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True), ) #call the father base function super(MySpider, self).__init__() def parse_next_page(self, response): self.logger.info(msg='begin init the page %s ' % response.url) list_item = response.xpath('//a[@class="nbg"]') #check the group is not null if list_item is None: self.logger.info(msg='cant select anything in selector ') return for a_item in list_item: item = GroupInfo() item['group_url'] = ''.join(a_item.xpath('@href').extract()) item['group_tag'] = self.current item['group_name'] = ''.join(a_item.xpath('@title').extract()) yield item def parse_start_url(self, response): self.logger.info(msg='begin init the start page %s ' % response.url) list_item = response.xpath('//a[@class="nbg"]') #check the group is not null if list_item is None: self.logger.info(msg='cant select anything in selector ') return for a_item in list_item: item = GroupInfo() item['group_url'] = ''.join(a_item.xpath('@href').extract()) item['group_tag'] = self.current item['group_name'] = ''.join(a_item.xpath('@title').extract()) yield item def parse_next_page_people(self, response): self.logger.info('Hi, this is an the next page! %s', response.url)
V. Actual Operation:
Scrapy crawl douban. xp -- logfile = test. log-a target = % E6 % 96% E5 % 87% B7
Actual data results:
This article mainly solves two problems:
1. How to pass the reference from the command line
2. How to compile a crawler
The demonstration functions are quite limited. In actual operation, it is actually necessary to write other rules, such as how to prevent ban. The next article will give a brief introduction.