Python crawler Frame Scrapy Learning Note 8----Spider

Source: Internet
Author: User
Tags xpath

What is a reptile?

From a logical point of view, a reptile corresponds to a tree. Branches are web pages, and leaves are information of interest.

When we look for interesting information from a URL, the content returned by the current URL may contain information that we are interested in, or it may contain another URL that may contain information that we are interested in. A reptile corresponding to a search for information, the information search process will establish a tree.

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/58/6F/wKioL1Sv6OmBHQnEAACEFi2pjKg209.jpg "title=" Spider.png "alt=" Wkiol1sv6ombhqneaacefi2pjkg209.jpg "/>

Scrapy. Spider This class provides an interface that allows us to design the entire information search process.


    1. Pass the parameters required by the runtime to the spider. Like URLs? Parameter information after the number. This information can optionally be passed using the CRAWL-A command

      650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/58/6F/wKioL1Sv6xajm8xKAANRGRppZX4042.jpg "title=" Param.png "alt=" Wkiol1sv6xajm8xkaanrgrppzx4042.jpg "/>


2. Spider cycle

For spiders, crawling loops are similar to the following:

  1. Initializes the request with the initial URL and sets the callback function. When the request is downloaded and returned, response is generated and passed as a parameter to the callback function.

    The initial request in the spider is done by invoking thestart_requests ()To get that.start_requests ()ReadStart_urlsIn the URL, and toParseGenerate for callback functionRequest。

  2. Parse the Returned (Web page) content within the callback function and returnItemObject orRequestOr an iterative container that includes both. The returned Request object is then scrapy processed, downloads the content, and invokes the set callback function (the function can be the same).

  3. Within the callback function, you can use the selector (selectors) (you can also use BeautifulSoup, lxml, or any parser you want) to parse the Web page content and generate item based on the analyzed data.

  4. Finally, the item returned by the spider will be stored in the database (handled by some Item Pipeline ) or stored in a file using the Feed exports .


Scrapy provides the default implementations of several spiders, which are highlighted below: The default Spider and Crawlspider


3. The simplest spider (The default spider)

Construct the Request object with the URL in instance property Start_urls

The framework is responsible for executing the request

Pass the response object returned by request to the Parse method for analysis


The simplified source code:

Class spider (object_ref):     "" "Base class for scrapy spiders.  All spiders must inherit from this    class.      "" "    name = none    def __init__ (Self,  name=none, **kwargs):         if name is not  none:            self.name = name         elif not getattr (self,  ' name ',  None):             raise valueerror ("%s must  have a name " % type (self). __name__)          Self.__dict__.update (Kwargs)         if not hasattr (self,   ' Start_urls '):   &Nbsp;         self.start_urls = []     def start_requests (self):        for url in  Self.start_urls:            yield self.make_ Requests_from_url (URL)     def make_requests_from_url (self, url):         return request (url, dont_filter=true)     def  parse (Self, response):        raise  Notimplementederrorbasespider = create_deprecated_class (' Basespider ',  Spider)


A callback function returns an example of multiple request

Import scrapyfrom myproject.items import myitemclass myspider (scrapy. Spider):    name =  ' example.com '     allowed_domains =  [' example.com ']    start_urls = [          ' http://www.example.com/1.html ',         '/http Www.example.com/2.html ',         ' http://www.example.com/3.html ',     ]    def parse (Self, response):         sel = scrapy. Selector (response)         for h3 in response.xpath ('// H3 '). Extract ():             yield myitem (title =H3)         for url in response.xpath ('//a/@href '). Extract):             yield scrapy. Request (Url, callback=self.parse)

Only two parameters are required to construct a Request object: URL and callback function


4. Crawlspider

Usually we need to decide in the spider: which links on the pages need to follow up, which pages end there, no need to follow the links inside. Crawlspider provides us with useful abstract--rule to make this kind of crawl task simple. You just have to tell scrapy in the rule which ones need to follow.

Recall the spider that we crawled Mininova website.

Class mininovaspider (crawlspider):    name =  ' Mininova '      allowed_domains = [' mininova.org ']    start_urls = ['/http Www.mininova.org/yesterday ']    rules = [rule (linkextractor (allow=['/tor/\d+ ')),   ' parse_torrent ')]    def parse_torrent (self, response):         torrent = torrentitem ()          torrent[' url '] = response.url        torrent[' name ']  = response.xpath ("//h1/text ()"). Extract ()         torrent[' Description '] = response.xpath ("//div[@id = ' description ']"). Extract ()          torrent[' Size '] = response.xpath ("//div[@id = ' specifications ']/p[2]/text () [2]"). Extract ()    &NBSp;    return torrent 

The meaning of the rules in the above code is: match the content returned by the/tor/\d+ URL, hand it over to parse_torrent, and no longer follow the URL on response.

An example is also available in the official documentation:

Rules = (# Extract links matching ' category.php ' (but not match ' subsection.php ') and follow up links (no callback means follow defaults to true) Rule (Linkextrac        Tor (allow= (' category\.php ',), deny= (' subsection\.php ',)), # Extract the link matching ' item.php ' and use the spider's Parse_item method for analysis Rule (Linkextractor (allow= (' item\.php ',)), callback= ' Parse_item '),)


In addition to spiders and crawlspider, there are Xmlfeedspider, Csvfeedspider, Sitemapspider

Python crawler Frame Scrapy Learning Note 8----Spider

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.