(4) How to do the crawler scrapy under distributed-rule automatic crawl and command-line next-pass parameter

Source: Internet
Author: User

The topic of this discussion is the implementation of rule crawling and the delivery of custom parameters under the command line, and the crawler under the rules is the real crawler in my opinion.

We choose to logically see how this reptile works:

We are given a starting point URL link, after entering the page to extract all the ur links, we define a rule, according to the rules (with regular expressions to limit) to extract the connection form we want, and then crawl these pages, take a step of processing (extract or other actions), and then loop the above operation, Until stopped, this time there is a potential problem, is to repeat the crawl, in the framework of the scrapy has begun to deal with these problems, generally speaking, for the crawl filtering problem, the general approach is to create a table of addresses, before crawling to check the Address table, whether it has been crawled, if it is, is filtered out directly. The other is the use of ready-made universal solutions, Bloom Filter

This discussion is about how to use Crawlspider to crawl all of the group's information under the Watercress tab:

One, we create a new class, Inherit from Crawlspider

 from Import Crawlspider, Rule  from Import Linkextractor  from Import GroupInfo class Myspider (Crawlspider):

For more instructions on Crawlspider, please refer to: Http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

Second, in order to complete the command-line parameter passing, we need to enter the parameters we want in the constructor of the class

Use this at the command line:

Scrapy Crawl Douban.xp--logfile=test.log-a target=%e6%96%87%e5%85%b7

This allows you to pass in the custom parameters to the inside

Here is a special description of the last line: Super (Myspider, self). __init__ ()

We go to the definition and see the definition of Crawlspider:

The constructor calls the private method to compile the rules variable, and if there is no calling method in our own defined spider, the error will be directly.

Third, write the rules:

     Self.rules = (            Rule (linkextractor('/group/explore[? Start=.*? [&]tag=.*?$',], restrict_xpaths= ('//span[@class = "Next")), callback='parse_next_page', follow=True),            )

The Allow definition wants to extract the label style, using regular matching, restrict_xpaths strictly restricts the label's range within the specified label, callback, and extracts the callback function after it.

Four, all code reference:

 fromScrapy.spidersImportCrawlspider, Rule fromScrapy.linkextractorsImportLinkextractor fromDouban.itemsImportGroupInfoclassMyspider (crawlspider): Name='Douban.xp' Current="'Allowed_domains= ['douban.com']    def __init__(Self, target=None):ifSelf.current is  not "': Target=self.currentifTarget is  notNone:self.current=Target Self.start_urls= [                'http://www.douban.com/group/explore?tag=%s'%(target)] Self.rules=(Rule (Linkextractor ( allow=('/group/explore[?] Start=.*? [&]tag=.*?$',), restrict_xpaths= ('//span[@class = "Next"]')), callback='Parse_next_page', follow=True),)#Call the Father base functionSuper (Myspider, self).__init__()           defparse_next_page (Self, Response): Self.logger.info (msg='begin init the page%s'%response.url) List_item= Response.xpath ('//a[@class = "NBG"]')        #Check the group is not null        ifList_item isNone:self.logger.info (msg='cant select anything in selector')            return         forA_iteminchList_item:item=groupinfo () item['Group_url'] ="'. Join (A_item.xpath ('@href'). Extract ()) item['Group_tag'] =self.current item['group_name'] ="'. Join (A_item.xpath ('@title'). Extract ())yieldItemdefParse_start_url (Self, Response): Self.logger.info (msg='begin init The start page%s'%response.url) List_item= Response.xpath ('//a[@class = "NBG"]')        #Check the group is not null        ifList_item isNone:self.logger.info (msg='cant select anything in selector')            return         forA_iteminchList_item:item=groupinfo () item['Group_url'] ="'. Join (A_item.xpath ('@href'). Extract ()) item['Group_tag'] =self.current item['group_name'] ="'. Join (A_item.xpath ('@title'). Extract ())yieldItemdefparse_next_page_people (Self, Response): Self.logger.info ('Hi, this is a next page!%s', Response.url)

Five, actual operation:

Scrapy Crawl Douban.xp--logfile=test.log-a target=%e6%96%87%e5%85%b7

Actual data effects:

This major solves two questions:

1. How to pass a reference from the command line

2. How to write Crawlspider

Inside the function of the demonstration is relatively limited, the actual operation is actually need to further write other rules, such as how to prevent the ban, the next article in a brief introduction

(4) How to do the crawler scrapy under distributed-rule automatic crawl and command-line next-pass parameter

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.