Recently see scrappy0.24 Official document looking at the moment of annoyance, accidentally found Chinese translation 0.24 documents, is a welfare ~ http://scrapy-chs.readthedocs.org/zh_CN/0.24/
In combination with official documentation examples, simply organize:
Import scrapyfrom myproject.items import myitemclass myspider (scrapy. Spider): name = ' Myspider ' start_urls = ( ' Http://example.com/page1 ', ' Http://example.com/page2 ', ) def parse (self, response): # collect ' Item_urls ' for item_url in item_urls: yield scrapy. Request (Item_url, self.parse_item) def parse_item (self, response): item = myitem () # populate ' Item ' fields # and extract item_details_url yield scrapy. Request (item_details_url, self.parse_details, meta={' item ': item}) def parse_details (Self, response): item = response.meta[' item '] # populate more ' item ' fields return item
Inherits a reptile from Spider, unique name name= "Myspider", reptile default entry address Start_urls = (), tuple or list is OK.
From the spider source, you can see:
# Code Snippet Class spider (object_ref): "" "base class for scrapy spiders. all spiders must inherit from this class. "" " name = none def __init__ ( Self, name=none, **kwargs): if name is not none: self.name = name elif not getattr (self, ' name ', None): raise valueerror ("%s Must have a name " % type (self). __name__) self.__dict__.update (Kwargs) if not hasattr ( self, ' Start_urls '): &NBSp; self.start_urls = []
When the spider is initialized, check if name is None,start_urls exists. The code is simple.
Continue looking down:
# code Snippet def start_requests (self): for URL in Self.start_urls:yield self.make_requests_from_url (URL) Def parse (self, Response): Raise Notimplementederror
It's easy to see, start_requests methods, traversing URLs in Start_urls, and executing request requests
The default response processing method entry, the parse function needs to be implemented, which is to override the parse method in the inheriting class.
Look again, in the sample code.
# The first function Def parse (self, Response): # collect ' Item_urls ' # can be understood as: a collection of hyperlinks for all navigation menus of the site for Item_url in I Tem_urls:yield Scrapy. Request (Item_url, Self.parse_item)
is the default entry, which is inherited from the parent spider class (or an interface that must be implemented), but it needs to be implemented.
In this function body, a collection of URLs named ' Item_urls ' is obtained based on the Response returned by start_requests (the default is a GET request).
These collections are then traversed and requested.
Then look at the Request source code:
# Part Code class request (OBJECT_REF): def __init__ (self, url, Callback=none, method= ' GET ', headers=none, body=none, cookies=none, meta=none, encoding= ' Utf-8 ', priority=0, dont_filter=false, errback=none): self._ Encoding = encoding # this one has to be set first self.method = str (method). Upper () self._set_url (URL) self._set_body (body) assert isinstance (priority, int), " request priority Not an integer: %r " % priority Self.priority = priority assert callback or not errback, "Cannot use errback without a callback" self.callback = callback self.errback = errback self.cookies = cookies or {} self.headers = headers (headers or {}, encoding=encoding) self.dont_ Filter = dont_filter self._meta = dict (meta ) if meta else None @property def meta ( Self): if self._meta is none: self._meta = {} return Self._meta
where to compare common parameters:
url: is the one that needs to be requested, and the next process urlcallback: specifies the response returned by the request, which is handled by that function. method: generally do not need to specify the header file that is included when a request is headers: using the default get method. Generally not required. The content is generally as follows: Use urllib2 wrote the crawler's affirmation to know Host: media.readthedocs.org user-agent: mozilla/5.0 ( windows nt 6.2; wow64; rv:33.0) Gecko/20100101 Firefox/33.0 Accept: text/css,*/*;q=0.1 accept-language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 accept-encoding: gzip, deflate referer: http:// Scrapy-chs.readthedocs.org/zh_cn/0.24/ cookie: _ga= ga1.2.1612165614.1415584110; connection: keep-alive if-modified-since: mon, 25 aug 2014 21:59:35 gmt Cache-Control: max-age=0meta: is more commonly used to pass data between different requests. Dictionary dict type request_with_cookies = request (url= "/HTTP/ www.example.com ", cookies={' currency ': ' USD ', ' country ': ' UY '}, meta={' dont_merge_cookies ': true}) encoding: use the default ' Utf-8 ' on the line. Dont_filter: indicates that this request should not be filtered by the scheduler. this is used when you want to perform an identical request multiple times, to ignore the duplicates Filter. use it with care, or you will get into crawling loops. Default to false.errback: Specifying error handling functions
No surprises, next is the source code of response :
# Part Code Class response (OBJECT_REF): def __init__ (self, url, Status=200, headers=none, body= ", flags=none, request=none): self.headers = headers (headers or {}) self.status = int (status) Self._set_body (body) self._set_url (URL) self.request = request Self.flags = [] if flags is none else list (Flags) @property def meta (self): try : return self.request.meta &nBsp;except attributeerror: raise attributeerror ("response.meta not available, this response " "is not tied to any request ")
The parameters are similar to the above.
A Response object represents an HTTP Response, which was usuallydownloaded (by the Downloader) and fed to the Spiders for P Rocessing. Available: Scrapy Shell http://xxxx.xxx.xx>>> dir (response) View information
Continue looking down:
# The second function Def parse_item (self, response): item = myitem () # Populate ' Item ' field # corresponds to the list page below the navigation bar, at which point There may also be a paging condition # and extract Item_details_url yield scrapy. Request (Item_details_url, self.parse_details, meta={' item ': item})
Receives the request response response for all URLs that the first function obtains and traverses. And in the current page to find all the detailed entity of the initial information, as well as a detailed URL address.
At this point, you need to continue down the request, requesting the detailed entity's page.
The item is used in this function, and it can not be used. Direct information (such as entities based on the general classification of navigation labels), passed to the next callback processing function via the request's META attribute.
Continue looking down:
# third function def parse_details (self, response): item = response.meta[' item '] # populate + ' item ' fields Return item
At this point, the request has been given the entity's specific page, which is the entity detail page. (For example, click into the article based on the blog's list of articles).
At this point you need to receive the information passed in from the previous function.
def parse_details (self, response): item = response.meta[' Item ']# can also be set with a default value of item = Response.meta.get (' Item ', None) # returns None when ' item ' key does not exist in the meta dictionary
Then in this page to use Xpath,css,re and so on to select the detailed field, as to the specific choice, later say ~ ~ ~ Originally wanted to write a simple point, and then so many ...
Finally, the resulting item is returned. This will be able to get the data in the Item_pipelines, and the next step to deal with it ~ ~ ~
Python crawler----(Scrapy framework Improved (1), custom request crawl)