Python crawler----(Scrapy framework Improved (1), custom request crawl)

Source: Internet
Author: User

Recently see scrappy0.24 Official document looking at the moment of annoyance, accidentally found Chinese translation 0.24 documents, is a welfare ~ http://scrapy-chs.readthedocs.org/zh_CN/0.24/

In combination with official documentation examples, simply organize:

Import scrapyfrom myproject.items import myitemclass myspider (scrapy. Spider):    name =  ' Myspider '     start_urls =  (          ' Http://example.com/page1 ',          ' Http://example.com/page2 ',        )      def parse (self, response):         # collect   ' Item_urls '         for item_url in item_urls:             yield scrapy. Request (Item_url, self.parse_item)     def parse_item (self, response):         item = myitem ()          # populate  ' Item '  fields        # and extract item_details_url         yield scrapy. Request (item_details_url, self.parse_details, meta={' item ':  item})     def  parse_details (Self, response):        item =  response.meta[' item ']        # populate more  ' item '  fields        return item

Inherits a reptile from Spider, unique name name= "Myspider", reptile default entry address Start_urls = (), tuple or list is OK.

From the spider source, you can see:

#  Code Snippet Class spider (object_ref):     "" "base class for scrapy  spiders. all spiders must inherit from this    class.      "" "    name = none    def __init__ ( Self, name=none, **kwargs):         if name is  not none:            self.name =  name        elif not getattr (self,  ' name ',  None):             raise valueerror ("%s  Must have a name " % type (self). __name__)          self.__dict__.update (Kwargs)         if not hasattr ( self,  ' Start_urls '): &NBSp;           self.start_urls = [] 

When the spider is initialized, check if name is None,start_urls exists. The code is simple.
Continue looking down:

# code Snippet def start_requests (self): for URL in Self.start_urls:yield self.make_requests_from_url (URL) Def parse (self, Response): Raise Notimplementederror

It's easy to see, start_requests methods, traversing URLs in Start_urls, and executing request requests

The default response processing method entry, the parse function needs to be implemented, which is to override the parse method in the inheriting class.

Look again, in the sample code.

# The first function Def parse (self, Response): # collect ' Item_urls ' # can be understood as: a collection of hyperlinks for all navigation menus of the site for Item_url in I Tem_urls:yield Scrapy. Request (Item_url, Self.parse_item)

is the default entry, which is inherited from the parent spider class (or an interface that must be implemented), but it needs to be implemented.

In this function body, a collection of URLs named ' Item_urls ' is obtained based on the Response returned by start_requests (the default is a GET request).

These collections are then traversed and requested.

Then look at the Request source code:

#  Part Code class request (OBJECT_REF):     def __init__ (self, url,  Callback=none, method= ' GET ', headers=none, body=none,                   cookies=none, meta=none, encoding= ' Utf-8 ', priority=0,                  dont_filter=false, errback=none):         self._ Encoding = encoding  # this one has to be set first         self.method = str (method). Upper ()          self._set_url (URL)         self._set_body (body)         assert isinstance (priority, int),  " request priority Not an integer: %r " % priority         Self.priority = priority        assert callback or  not errback,  "Cannot use errback without a callback"          self.callback = callback         self.errback = errback        self.cookies  = cookies or {}        self.headers = headers (headers or {}, encoding=encoding)         self.dont_ Filter = dont_filter        self._meta = dict (meta )  if meta else None     @property     def meta ( Self):         if self._meta is none:             self._meta = {}        return  Self._meta

     where to compare common parameters:

url:  is the one that needs to be requested, and the next process urlcallback:  specifies the response returned by the request, which is handled by that function. method:  generally do not need to specify the header file that is included when a request is headers:  using the default get method. Generally not required. The content is generally as follows: Use  urllib2  wrote the crawler's affirmation to know         Host:  media.readthedocs.org        user-agent: mozilla/5.0  ( windows nt 6.2; wow64; rv:33.0)  Gecko/20100101 Firefox/33.0         Accept: text/css,*/*;q=0.1         accept-language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3         accept-encoding: gzip, deflate        referer: http:// Scrapy-chs.readthedocs.org/zh_cn/0.24/        cookie: _ga= ga1.2.1612165614.1415584110;        connection: keep-alive         if-modified-since: mon, 25 aug 2014 21:59:35 gmt         Cache-Control: max-age=0meta:  is more commonly used to pass data between different requests. Dictionary dict type         request_with_cookies = request (url= "/HTTP/ www.example.com ",                                         cookies={' currency ':  ' USD ',  ' country ':  ' UY '},                                          meta={' dont_merge_cookies ':  true}) encoding:  use the default   ' Utf-8 '   on the line. Dont_filter: indicates that this request should not be filtered by the scheduler.               this is used when you want  to perform an identical request multiple times,               to ignore the duplicates  Filter. use it with care, or you will get into crawling  loops.              Default  to false.errback:  Specifying error handling functions

     No surprises, next is the source code of response :

#  Part Code Class response (OBJECT_REF):     def __init__ (self, url,  Status=200, headers=none, body= ",  flags=none, request=none):         self.headers = headers (headers or {})          self.status = int (status)          Self._set_body (body)         self._set_url (URL)          self.request = request         Self.flags = [] if flags is none else list (Flags)       @property     def meta (self):         try :            return self.request.meta        &nBsp;except attributeerror:            raise  attributeerror ("response.meta not available, this response "                    "is not tied  to any request ")

The parameters are similar to the above.

A Response object represents an HTTP Response, which was usuallydownloaded (by the Downloader) and fed to the Spiders for P Rocessing. Available: Scrapy Shell http://xxxx.xxx.xx>>> dir (response) View information

Continue looking down:

# The second function Def parse_item (self, response): item = myitem () # Populate ' Item ' field # corresponds to the list page below the navigation bar, at which point There may also be a paging condition # and extract Item_details_url yield scrapy. Request (Item_details_url, self.parse_details, meta={' item ': item})

Receives the request response response for all URLs that the first function obtains and traverses. And in the current page to find all the detailed entity of the initial information, as well as a detailed URL address.

At this point, you need to continue down the request, requesting the detailed entity's page.

The item is used in this function, and it can not be used. Direct information (such as entities based on the general classification of navigation labels), passed to the next callback processing function via the request's META attribute.

Continue looking down:

# third function def parse_details (self, response): item = response.meta[' item '] # populate + ' item ' fields Return item

At this point, the request has been given the entity's specific page, which is the entity detail page. (For example, click into the article based on the blog's list of articles).

At this point you need to receive the information passed in from the previous function.

def parse_details (self, response): item = response.meta[' Item ']# can also be set with a default value of item = Response.meta.get (' Item ', None) # returns None when ' item ' key does not exist in the meta dictionary

Then in this page to use Xpath,css,re and so on to select the detailed field, as to the specific choice, later say ~ ~ ~ Originally wanted to write a simple point, and then so many ...

Finally, the resulting item is returned. This will be able to get the data in the Item_pipelines, and the next step to deal with it ~ ~ ~





Python crawler----(Scrapy framework Improved (1), custom request crawl)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.