Python crawler----(Scrapy framework Improved (1), custom request crawl)

Last Update:2014-11-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently see scrappy0.24 Official document looking at the moment of annoyance, accidentally found Chinese translation 0.24 documents, is a welfare ~ http://scrapy-chs.readthedocs.org/zh_CN/0.24/

In combination with official documentation examples, simply organize:

Import scrapyfrom myproject.items import myitemclass myspider (scrapy. Spider):    name =  ' Myspider '     start_urls =  (          ' Http://example.com/page1 ',          ' Http://example.com/page2 ',        )      def parse (self, response):         # collect   ' Item_urls '         for item_url in item_urls:             yield scrapy. Request (Item_url, self.parse_item)     def parse_item (self, response):         item = myitem ()          # populate  ' Item '  fields        # and extract item_details_url         yield scrapy. Request (item_details_url, self.parse_details, meta={' item ':  item})     def  parse_details (Self, response):        item =  response.meta[' item ']        # populate more  ' item '  fields        return item

Inherits a reptile from Spider, unique name name= "Myspider", reptile default entry address Start_urls = (), tuple or list is OK.

From the spider source, you can see:

#  Code Snippet Class spider (object_ref):     "" "base class for scrapy  spiders. all spiders must inherit from this    class.      "" "    name = none    def __init__ ( Self, name=none, **kwargs):         if name is  not none:            self.name =  name        elif not getattr (self,  ' name ',  None):             raise valueerror ("%s  Must have a name " % type (self). __name__)          self.__dict__.update (Kwargs)         if not hasattr ( self,  ' Start_urls '): &NBSp;           self.start_urls = []

When the spider is initialized, check if name is None,start_urls exists. The code is simple.
Continue looking down:

# code Snippet def start_requests (self): for URL in Self.start_urls:yield self.make_requests_from_url (URL) Def parse (self, Response): Raise Notimplementederror

It's easy to see, start_requests methods, traversing URLs in Start_urls, and executing request requests

The default response processing method entry, the parse function needs to be implemented, which is to override the parse method in the inheriting class.

Look again, in the sample code.

# The first function Def parse (self, Response): # collect ' Item_urls ' # can be understood as: a collection of hyperlinks for all navigation menus of the site for Item_url in I Tem_urls:yield Scrapy. Request (Item_url, Self.parse_item)

is the default entry, which is inherited from the parent spider class (or an interface that must be implemented), but it needs to be implemented.

In this function body, a collection of URLs named ' Item_urls ' is obtained based on the Response returned by start_requests (the default is a GET request).

These collections are then traversed and requested.

Then look at the Request source code:

#  Part Code class request (OBJECT_REF):     def __init__ (self, url,  Callback=none, method= ' GET ', headers=none, body=none,                   cookies=none, meta=none, encoding= ' Utf-8 ', priority=0,                  dont_filter=false, errback=none):         self._ Encoding = encoding  # this one has to be set first         self.method = str (method). Upper ()          self._set_url (URL)         self._set_body (body)         assert isinstance (priority, int),  " request priority Not an integer: %r " % priority         Self.priority = priority        assert callback or  not errback,  "Cannot use errback without a callback"          self.callback = callback         self.errback = errback        self.cookies  = cookies or {}        self.headers = headers (headers or {}, encoding=encoding)         self.dont_ Filter = dont_filter        self._meta = dict (meta )  if meta else None     @property     def meta ( Self):         if self._meta is none:             self._meta = {}        return  Self._meta

where to compare common parameters:

url:  is the one that needs to be requested, and the next process urlcallback:  specifies the response returned by the request, which is handled by that function. method:  generally do not need to specify the header file that is included when a request is headers:  using the default get method. Generally not required. The content is generally as follows: Use  urllib2  wrote the crawler's affirmation to know         Host:  media.readthedocs.org        user-agent: mozilla/5.0  ( windows nt 6.2; wow64; rv:33.0)  Gecko/20100101 Firefox/33.0         Accept: text/css,*/*;q=0.1         accept-language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3         accept-encoding: gzip, deflate        referer: http:// Scrapy-chs.readthedocs.org/zh_cn/0.24/        cookie: _ga= ga1.2.1612165614.1415584110;        connection: keep-alive         if-modified-since: mon, 25 aug 2014 21:59:35 gmt         Cache-Control: max-age=0meta:  is more commonly used to pass data between different requests. Dictionary dict type         request_with_cookies = request (url= "/HTTP/ www.example.com ",                                         cookies={' currency ':  ' USD ',  ' country ':  ' UY '},                                          meta={' dont_merge_cookies ':  true}) encoding:  use the default   ' Utf-8 '   on the line. Dont_filter: indicates that this request should not be filtered by the scheduler.               this is used when you want  to perform an identical request multiple times,               to ignore the duplicates  Filter. use it with care, or you will get into crawling  loops.              Default  to false.errback:  Specifying error handling functions

No surprises, next is the source code of response :

#  Part Code Class response (OBJECT_REF):     def __init__ (self, url,  Status=200, headers=none, body= ",  flags=none, request=none):         self.headers = headers (headers or {})          self.status = int (status)          Self._set_body (body)         self._set_url (URL)          self.request = request         Self.flags = [] if flags is none else list (Flags)       @property     def meta (self):         try :            return self.request.meta        &nBsp;except attributeerror:            raise  attributeerror ("response.meta not available, this response "                    "is not tied  to any request ")

The parameters are similar to the above.

A Response object represents an HTTP Response, which was usuallydownloaded (by the Downloader) and fed to the Spiders for P Rocessing. Available: Scrapy Shell http://xxxx.xxx.xx>>> dir (response) View information

Continue looking down:

# The second function Def parse_item (self, response): item = myitem () # Populate ' Item ' field # corresponds to the list page below the navigation bar, at which point There may also be a paging condition # and extract Item_details_url yield scrapy. Request (Item_details_url, self.parse_details, meta={' item ': item})

Receives the request response response for all URLs that the first function obtains and traverses. And in the current page to find all the detailed entity of the initial information, as well as a detailed URL address.

At this point, you need to continue down the request, requesting the detailed entity's page.

The item is used in this function, and it can not be used. Direct information (such as entities based on the general classification of navigation labels), passed to the next callback processing function via the request's META attribute.

Continue looking down:

# third function def parse_details (self, response): item = response.meta[' item '] # populate + ' item ' fields Return item

At this point, the request has been given the entity's specific page, which is the entity detail page. (For example, click into the article based on the blog's list of articles).

At this point you need to receive the information passed in from the previous function.

def parse_details (self, response): item = response.meta[' Item ']# can also be set with a default value of item = Response.meta.get (' Item ', None) # returns None when ' item ' key does not exist in the meta dictionary

Then in this page to use Xpath,css,re and so on to select the detailed field, as to the specific choice, later say ~ ~ ~ Originally wanted to write a simple point, and then so many ...

Finally, the resulting item is returned. This will be able to get the data in the Item_pipelines, and the next step to deal with it ~ ~ ~

Python crawler----(Scrapy framework Improved (1), custom request crawl)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler----(Scrapy framework Improved (1), custom request crawl)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler----(Scrapy framework Improved (1), custom request crawl)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support