Python crawler Frame Scrapy Learning Note 8----Spider

Last Update:2015-01-11 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a reptile?

From a logical point of view, a reptile corresponds to a tree. Branches are web pages, and leaves are information of interest.

When we look for interesting information from a URL, the content returned by the current URL may contain information that we are interested in, or it may contain another URL that may contain information that we are interested in. A reptile corresponding to a search for information, the information search process will establish a tree.

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/58/6F/wKioL1Sv6OmBHQnEAACEFi2pjKg209.jpg "title=" Spider.png "alt=" Wkiol1sv6ombhqneaacefi2pjkg209.jpg "/>

Scrapy. Spider This class provides an interface that allows us to design the entire information search process.

Pass the parameters required by the runtime to the spider. Like URLs? Parameter information after the number. This information can optionally be passed using the CRAWL-A command
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/58/6F/wKioL1Sv6xajm8xKAANRGRppZX4042.jpg "title=" Param.png "alt=" Wkiol1sv6xajm8xkaanrgrppzx4042.jpg "/>

2. Spider cycle

For spiders, crawling loops are similar to the following:

Initializes the request with the initial URL and sets the callback function. When the request is downloaded and returned, response is generated and passed as a parameter to the callback function.
The initial request in the spider is done by invoking thestart_requests ()To get that.start_requests ()ReadStart_urlsIn the URL, and toParseGenerate for callback functionRequest。
Parse the Returned (Web page) content within the callback function and returnItemObject orRequestOr an iterative container that includes both. The returned Request object is then scrapy processed, downloads the content, and invokes the set callback function (the function can be the same).
Within the callback function, you can use the selector (selectors) (you can also use BeautifulSoup, lxml, or any parser you want) to parse the Web page content and generate item based on the analyzed data.
Finally, the item returned by the spider will be stored in the database (handled by some Item Pipeline ) or stored in a file using the Feed exports .

Scrapy provides the default implementations of several spiders, which are highlighted below: The default Spider and Crawlspider

3. The simplest spider (The default spider)

Construct the Request object with the URL in instance property Start_urls

The framework is responsible for executing the request

Pass the response object returned by request to the Parse method for analysis

The simplified source code:

Class spider (object_ref):     "" "Base class for scrapy spiders.  All spiders must inherit from this    class.      "" "    name = none    def __init__ (Self,  name=none, **kwargs):         if name is not  none:            self.name = name         elif not getattr (self,  ' name ',  None):             raise valueerror ("%s must  have a name " % type (self). __name__)          Self.__dict__.update (Kwargs)         if not hasattr (self,   ' Start_urls '): &NBSP;&NBSP;&Nbsp;         self.start_urls = []     def start_requests (self):        for url in  Self.start_urls:            yield self.make_ Requests_from_url (URL)     def make_requests_from_url (self, url):         return request (url, dont_filter=true)     def  parse (Self, response):        raise  Notimplementederrorbasespider = create_deprecated_class (' Basespider ',  Spider)

A callback function returns an example of multiple request

Import scrapyfrom myproject.items import myitemclass myspider (scrapy. Spider):    name =  ' example.com '     allowed_domains =  [' example.com ']    start_urls = [          ' http://www.example.com/1.html ',         '/http Www.example.com/2.html ',         ' http://www.example.com/3.html ',     ]    def parse (Self, response):         sel = scrapy. Selector (response)         for h3 in response.xpath ('// H3 '). Extract ():             yield myitem (title =H3)         for url in response.xpath ('//a/@href '). Extract):             yield scrapy. Request (Url, callback=self.parse)

Only two parameters are required to construct a Request object: URL and callback function

4. Crawlspider

Usually we need to decide in the spider: which links on the pages need to follow up, which pages end there, no need to follow the links inside. Crawlspider provides us with useful abstract--rule to make this kind of crawl task simple. You just have to tell scrapy in the rule which ones need to follow.

Recall the spider that we crawled Mininova website.

Class mininovaspider (crawlspider):    name =  ' Mininova '      allowed_domains = [' mininova.org ']    start_urls = ['/http Www.mininova.org/yesterday ']    rules = [rule (linkextractor (allow=['/tor/\d+ ')),   ' parse_torrent ')]    def parse_torrent (self, response):         torrent = torrentitem ()          torrent[' url '] = response.url        torrent[' name ']  = response.xpath ("//h1/text ()"). Extract ()         torrent[' Description '] = response.xpath ("//div[@id = ' description ']"). Extract ()          torrent[' Size '] = response.xpath ("//div[@id = ' specifications ']/p[2]/text () [2]"). Extract () &NBSP;&NBSP;&NBSP;&NBSp;    return torrent

The meaning of the rules in the above code is: match the content returned by the/tor/\d+ URL, hand it over to parse_torrent, and no longer follow the URL on response.

An example is also available in the official documentation:

Rules = (# Extract links matching ' category.php ' (but not match ' subsection.php ') and follow up links (no callback means follow defaults to true) Rule (Linkextrac        Tor (allow= (' category\.php ',), deny= (' subsection\.php ',)), # Extract the link matching ' item.php ' and use the spider's Parse_item method for analysis Rule (Linkextractor (allow= (' item\.php ',)), callback= ' Parse_item '),)

In addition to spiders and crawlspider, there are Xmlfeedspider, Csvfeedspider, Sitemapspider

Python crawler Frame Scrapy Learning Note 8----Spider

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More