The topic of this discussion is the implementation of rule crawling and the delivery of custom parameters under the command line, and the crawler under the rules is the real crawler in my opinion.
We choose to logically see how this reptile works:
We are given a starting point URL link, after entering the page to extract all the ur links, we define a rule, according to the rules (with regular expressions to limit) to extract the connection form we want, and then crawl these pages, take a step of processing (extract or other actions), and then loop the above operation, Until stopped, this time there is a potential problem, is to repeat the crawl, in the framework of the scrapy has begun to deal with these problems, generally speaking, for the crawl filtering problem, the general approach is to create a table of addresses, before crawling to check the Address table, whether it has been crawled, if it is, is filtered out directly. The other is the use of ready-made universal solutions, Bloom Filter
This discussion is about how to use Crawlspider to crawl all of the group's information under the Watercress tab:
One, we create a new class, Inherit from Crawlspider
from Import Crawlspider, Rule from Import Linkextractor from Import GroupInfo class Myspider (Crawlspider):
For more instructions on Crawlspider, please refer to: Http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
Second, in order to complete the command-line parameter passing, we need to enter the parameters we want in the constructor of the class
:
Use this at the command line:
Scrapy Crawl Douban.xp--logfile=test.log-a target=%e6%96%87%e5%85%b7
This allows you to pass in the custom parameters to the inside
Here is a special description of the last line: Super (Myspider, self). __init__ ()
We go to the definition and see the definition of Crawlspider:
The constructor calls the private method to compile the rules variable, and if there is no calling method in our own defined spider, the error will be directly.
Third, write the rules:
Self.rules = ( Rule (linkextractor('/group/explore[? Start=.*? [&]tag=.*?$',], restrict_xpaths= ('//span[@class = "Next")), callback='parse_next_page', follow=True), )
The Allow definition wants to extract the label style, using regular matching, restrict_xpaths strictly restricts the label's range within the specified label, callback, and extracts the callback function after it.
Four, all code reference:
fromScrapy.spidersImportCrawlspider, Rule fromScrapy.linkextractorsImportLinkextractor fromDouban.itemsImportGroupInfoclassMyspider (crawlspider): Name='Douban.xp' Current="'Allowed_domains= ['douban.com'] def __init__(Self, target=None):ifSelf.current is not "': Target=self.currentifTarget is notNone:self.current=Target Self.start_urls= [ 'http://www.douban.com/group/explore?tag=%s'%(target)] Self.rules=(Rule (Linkextractor ( allow=('/group/explore[?] Start=.*? [&]tag=.*?$',), restrict_xpaths= ('//span[@class = "Next"]')), callback='Parse_next_page', follow=True),)#Call the Father base functionSuper (Myspider, self).__init__() defparse_next_page (Self, Response): Self.logger.info (msg='begin init the page%s'%response.url) List_item= Response.xpath ('//a[@class = "NBG"]') #Check the group is not null ifList_item isNone:self.logger.info (msg='cant select anything in selector') return forA_iteminchList_item:item=groupinfo () item['Group_url'] ="'. Join (A_item.xpath ('@href'). Extract ()) item['Group_tag'] =self.current item['group_name'] ="'. Join (A_item.xpath ('@title'). Extract ())yieldItemdefParse_start_url (Self, Response): Self.logger.info (msg='begin init The start page%s'%response.url) List_item= Response.xpath ('//a[@class = "NBG"]') #Check the group is not null ifList_item isNone:self.logger.info (msg='cant select anything in selector') return forA_iteminchList_item:item=groupinfo () item['Group_url'] ="'. Join (A_item.xpath ('@href'). Extract ()) item['Group_tag'] =self.current item['group_name'] ="'. Join (A_item.xpath ('@title'). Extract ())yieldItemdefparse_next_page_people (Self, Response): Self.logger.info ('Hi, this is a next page!%s', Response.url)
Five, actual operation:
Scrapy Crawl Douban.xp--logfile=test.log-a target=%e6%96%87%e5%85%b7
Actual data effects:
This major solves two questions:
1. How to pass a reference from the command line
2. How to write Crawlspider
Inside the function of the demonstration is relatively limited, the actual operation is actually need to further write other rules, such as how to prevent the ban, the next article in a brief introduction
(4) How to do the crawler scrapy under distributed-rule automatic crawl and command-line next-pass parameter