(4) What should Scrapy do for Distributed crawlers?-automatic rule crawling and command line passing parameters; crawler scrapy

Last Update:2015-09-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The topic of this discussion is the implementation of rule crawling and the transmission of custom parameters under the command line. crawlers under the rule are actually crawlers in my opinion.

Logically, we choose how this crawler works:

We give a starting point url link. after entering the page, we extract all ur links. We define a rule to extract the desired connection form according to the rule (restricted by regular expressions, then crawl these pages, perform step-by-step processing (data extraction or other actions), and then repeat the above operations until they are stopped. A potential problem at this time is repeated crawling, in the scrapy framework, we have already started to deal with these problems. Generally, for the crawling and filtering problem, the general solution is to create an address table, check whether the address table has been crawled before crawling. If yes, filter it out directly. Another method is to use the off-the-shelf general solution, bloom filter

This article discusses how to use CrawlSpider to crawl information of all groups under the watercress label:

1. We create a new class that inherits from CrawlSpider.

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):

For more information about CrawlSpider, see: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

2. To complete the parameter transfer in the command line, we need to input the desired parameter in the class constructor.

Use the following command line:

Scrapy crawl douban. xp -- logfile = test. log-a target = % E6 % 96% E5 % 87% B7

In this way, the custom parameters can be passed in.

The last line: super (MySpider, self). _ init __()

Go to the definition to view the definition of the crawler:

The constructor calls a private method to compile the rules variable. If no method is called in the self-defined Spider, an error is reported directly.

3. Write rules:

     self.rules = (            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),            )

Allow defines the tag style to be extracted and uses regular expression matching. restrict_xpaths strictly limits the range of the tag within the specified tag, callback, And the callback function after extraction.

4. For all the code, refer:

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom douban.items import GroupInfoclass MySpider(CrawlSpider):    name = 'douban.xp'    current = ''    allowed_domains = ['douban.com']    def __init__(self, target=None):        if self.current is not '':            target = self.current        if target is not None:            self.current = target        self.start_urls = [                'http://www.douban.com/group/explore?tag=%s' % (target)            ]              self.rules = (            Rule(LinkExtractor(allow=('/group/explore[?]start=.*?[&]tag=.*?$', ), restrict_xpaths=('//span[@class="next"]')), callback='parse_next_page',follow=True),            )        #call the father base function         super(MySpider, self).__init__()           def parse_next_page(self, response):        self.logger.info(msg='begin init the page %s ' % response.url)        list_item = response.xpath('//a[@class="nbg"]')        #check the group is not null         if list_item is None:            self.logger.info(msg='cant select anything in selector ')            return        for a_item in list_item:            item = GroupInfo()            item['group_url'] = ''.join(a_item.xpath('@href').extract())            item['group_tag'] = self.current            item['group_name'] = ''.join(a_item.xpath('@title').extract())            yield item        def parse_start_url(self, response):        self.logger.info(msg='begin init the start page %s ' % response.url)        list_item = response.xpath('//a[@class="nbg"]')        #check the group is not null         if list_item is None:            self.logger.info(msg='cant select anything in selector ')            return        for a_item in list_item:            item = GroupInfo()            item['group_url'] = ''.join(a_item.xpath('@href').extract())            item['group_tag'] = self.current            item['group_name'] = ''.join(a_item.xpath('@title').extract())            yield item    def parse_next_page_people(self, response):        self.logger.info('Hi, this is an the next page! %s', response.url)

V. Actual Operation:

Scrapy crawl douban. xp -- logfile = test. log-a target = % E6 % 96% E5 % 87% B7

Actual data results:

This article mainly solves two problems:

1. How to pass the reference from the command line

2. How to compile a crawler

The demonstration functions are quite limited. In actual operation, it is actually necessary to write other rules, such as how to prevent ban. The next article will give a brief introduction.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(4) What should Scrapy do for Distributed crawlers?-automatic rule crawling and command line passing parameters; crawler scrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(4) What should Scrapy do for Distributed crawlers?-automatic rule crawling and command line passing parameters; crawler scrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support