(4) What should Scrapy do for Distributed crawlers?-automatic rule crawling and command line passing parameters; crawler scrapy

Source: Internet
Author: User

(4) What should Scrapy do for Distributed crawlers?-automatic rule crawling and command line passing parameters; crawler scrapy

The topic of this discussion is the implementation of rule crawling and the transmission of custom parameters under the command line. crawlers under the rule are actually crawlers in my opinion.

Logically, we choose how this crawler works:

 

We give a starting point url link. after entering the page, we extract all ur links. We define a rule to extract the desired connection form according to the rule (restricted by regular expressions, then crawl these pages, perform step-by-step processing (data extraction or other actions), and then repeat the above operations until they are stopped. A potential problem at this time is repeated crawling, in the scrapy framework, we have already started to deal with these problems. Generally, for the crawling and filtering problem, the general solution is to create an address table, check whether the address table has been crawled before crawling. If yes, filter it out directly. Another method is to use the off-the-shelf general solution, bloom filter

 

This article discusses how to use CrawlSpider to crawl information of all groups under the watercress label:

 

1. We create a new class that inherits from CrawlSpider.

 

 

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):

 

For more information about CrawlSpider, see: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

 

2. To complete the parameter transfer in the command line, we need to input the desired parameter in the class constructor.

:

 

Use the following command line:

Scrapy crawl douban. xp -- logfile = test. log-a target = % E6 % 96% E5 % 87% B7

In this way, the custom parameters can be passed in.

 

The last line: super (MySpider, self). _ init __()

Go to the definition to view the definition of the crawler:

The constructor calls a private method to compile the rules variable. If no method is called in the self-defined Spider, an error is reported directly.

 

3. Write rules:

 

     self.rules = (            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),            )

 

Allow defines the tag style to be extracted and uses regular expression matching. restrict_xpaths strictly limits the range of the tag within the specified tag, callback, And the callback function after extraction.

 

4. For all the code, refer:

 

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):    name = 'douban.xp'    current = ''    allowed_domains = ['douban.com']    def __init__(self, target=None):        if self.current is not '':            target = self.current        if target is not None:            self.current = target        self.start_urls = [                'http://www.douban.com/group/explore?tag=%s' % (target)            ]              self.rules = (            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),            )        #call the father base function         super(MySpider, self).__init__()           def parse_next_page(self, response):        self.logger.info(msg='begin init the page %s ' % response.url)        list_item = response.xpath('//a[@class="nbg"]')        #check the group is not null         if list_item is None:            self.logger.info(msg='cant select anything in selector ')            return        for a_item in list_item:            item = GroupInfo()            item['group_url'] = ''.join(a_item.xpath('@href').extract())            item['group_tag'] = self.current            item['group_name'] = ''.join(a_item.xpath('@title').extract())            yield item        def parse_start_url(self, response):        self.logger.info(msg='begin init the start page %s ' % response.url)        list_item = response.xpath('//a[@class="nbg"]')        #check the group is not null         if list_item is None:            self.logger.info(msg='cant select anything in selector ')            return        for a_item in list_item:            item = GroupInfo()            item['group_url'] = ''.join(a_item.xpath('@href').extract())            item['group_tag'] = self.current            item['group_name'] = ''.join(a_item.xpath('@title').extract())            yield item    def parse_next_page_people(self, response):        self.logger.info('Hi, this is an the next page! %s', response.url)

 

V. Actual Operation:

 

Scrapy crawl douban. xp -- logfile = test. log-a target = % E6 % 96% E5 % 87% B7

 

Actual data results:

 

 

 

This article mainly solves two problems:

1. How to pass the reference from the command line

2. How to compile a crawler

The demonstration functions are quite limited. In actual operation, it is actually necessary to write other rules, such as how to prevent ban. The next article will give a brief introduction.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.