Scrapy Crawler Beginner tutorial four spider (crawler)

Source: Internet
Author: User
Tags http authentication
http://www.php.cn/wiki/1514.html "target=" _blank ">python version management: Pyenv and Pyenv-virtualenv
Scrapy Crawler Introductory Tutorial one installation and basic use
Scrapy Crawler Introductory Tutorial II official Demo
Scrapy Crawler Introductory Tutorials three command-line tools introduction and examples
Scrapy Crawler Beginner tutorial four spider (crawler)
Scrapy Crawler Beginner Tutorial Five selectors (selector)
Scrapy crawler Getting Started tutorial six items (project)
Scrapy Crawler Start tutorial Seven item loaders (Project loader)
Scrapy Crawler Introductory tutorial Eight interactive shell for easy commissioning
Scrapy crawler Getting Started tutorial nine item Pipeline (Project pipeline)
Scrapy Crawler Starter Tutorial 10 Feed Exports (export file)
Scrapy crawler Getting Started Tutorial 11 request and response (requests and responses)
Scrapy crawler Getting Started tutorial ten link extractors (link extractor)

[TOC]

Development environment:
Python 3.6.0 版本(currently up-to-date)
Scrapy 1.3.2 版本(currently up-to-date)

Spider

A crawler is a class that defines how to crawl a site (or a group of sites), including how to perform a crawl (that is, focus on links) and how to extract structured data (that is, crawl items) from its Web pages. In other words, the spider is where you define custom behavior for crawling and resolving web pages for a particular site (or, in some cases, a group of sites).

For crawlers, the loop goes through something like this:

    1. You first generate the initial request to crawl the first URL, and then you specify the callback function to use to invoke the response that was downloaded from those requests.

      The first execution request is obtained by calling Start_requests () (by default) to the parse method generated for the URL specified in Start_urls and, and the method acts as the callback function for the request.

    2. In the callback function, you will parse the response (Web page) and return an object with the extracted data, the item object, the request object, or an iteration of those objects. These requests will also contain callbacks (which may be the same), which are then downloaded by scrapy and then processed by the specified callback to respond to them.

    3. In a callback function, you typically use a selector to parse the page content (but you can also use beautifulsoup,lxml or any mechanism you prefer) and use the parsed data to build the project.

    4. Finally, items returned from the crawler are typically persisted to the database (in some project pipelines) or written to a file using a feed export.

Even though this loop (more or less) applies to any kind of crawler, there are different kinds of default crawlers bundled into scrapy for different purposes. We're going to talk about these types here.

class scrapy.spiders.Spider

This is the simplest reptile that every other reptile must inherit (including the crawlers bundled with Scrapy, and the crawlers you write yourself). It does not provide any special features. It simply provides a default start_requests() implementation that start_urlsspider sends requests from properties and parse responds to calls for each result spider method.

name
A string that defines the name of this crawler. The reptile name is how the crawler is positioned (and instantiated) by Scrapy, so it must be unique . However, there is nothing to prevent you from instantiating multiple instances of the same crawler. This is the most important reptile attribute and it is required.

If a crawler crawls a single domain name, it is common practice to name the crawler behind the domain. So, for example, crawled crawler mywebsite.com are usually called mywebsite.

Attention
In Python 2, this must be ASCII.

allowed_domains
An optional list of strings that allow this crawler to crawl the domain, specify that a list can be crawled, and others will not crawl.

start_urls
When a specific URL is not specified, the crawler will begin to crawl the list of URLs.

custom_settings
The settings dictionary that will be overwritten from the project wide configuration when running this crawler. It must be defined as a class property because the settings are updated before instantiating.

For a list of the available built-in settings, see: Built-in Settings reference.

crawler
This property From_crawler () is set by the class method after the class is initialized and links crawler to the object to which this crawler instance is bound.

Crawlers encapsulates a number of components in a project for single-entry access (such as extensions, middleware, signal managers, etc.). See the Crawler API for details.

settings
Run the configuration of this crawler. This is a settings instance, see Setting up a topic for a more detailed introduction to this topic.

logger
Python recorder created with Spider name . You can use it to send log messages through it, as described in the logging crawler.

from_crawler(Crawler, args,* Kwargs)
Scrapy is the class method used to create crawlers.

You may not need to override this directly, because the default implementation acts as a proxy for the method, init() invoking it with the given parameter, args, and named parameter Kwargs.

However, this method sets the crawler and Settings properties in the new instance so that they can be accessed later in the crawler.

    • Parameters:

      • Crawler (crawlerinstance)-Crawler to bind to

      • Args (list)-arguments passed to the init() method

      • Kwargs (dict)-keyword argument passed to the init() method

start_requests()
This method must return an iteration of the first request to crawl the crawler.

With the start_requests (), do not write the Start_urls, wrote it is no use.

The default implementation is: Start_urls, but the method start_requests can be replicated.
For example, if you need to start by using a POST request login, you can:

Class Myspider (Scrapy. Spider):    name = ' Myspider '    def start_requests (self):        return [Scrapy. Formrequest ("Http://www.example.com/login",                                   formdata={' user ': ' John ', ' Pass ': ' Secret '},                                   callback= self.logged_in)]    def logged_in (self, Response):        # Here's would extract links to follow and return requests for
  # each of the them, with another callback        pass

make_requests_from_url(url)
A method of receiving a URL and returning a Request object (or list of request objects) to fetch. This method is used to construct the initial request start_requests () in the method, and is typically used to convert the URL to a request.

Unless overridden, this method returns the requests parse () that has the method as their callback function and enables the Dont_filter parameter (Request for more information, see classes).

parse(response)
This is the default callback that Scrapy uses to process the download response when their request does not specify a callback.

The parse method is responsible for processing the response and returning the fetched data or more URLs. Other request callbacks have the same requirements as the Spider class.

This method, and any other request callbacks, must return an iterator to the request and Dicts or item object.

    • Parameters:

      • Response (response)-parsed response

log(message[, level, component])
The wrapper sends a log message logger through the crawler, maintaining backward compatibility. For more information, see Record from Spider.

closed(reason)
Called when the crawler is closed. This method provides a shortcut to the Signals.connect () of the spider_closed signal.

Let's look at an example:

Import Scrapyclass Myspider (scrapy. Spider):    name = ' example.com '    allowed_domains = [' example.com ']    start_urls = [        '/HTTP/ Www.example.com/1.html ',        ' http://www.example.com/2.html ',        ' http://www.example.com/3.html ',    ]    def parse (self, Response):        self.logger.info (' A response from%s just arrived! ', Response.url)

To return multiple requests and items from a single callback:

Import Scrapyclass Myspider (scrapy. Spider):    name = ' example.com '    allowed_domains = [' example.com ']    start_urls = [        '/HTTP/ Www.example.com/1.html ',        ' http://www.example.com/2.html ',        ' http://www.example.com/3.html ',    ]    Def parse (self, Response): For        H3 in Response.xpath ('//h3 '). Extract ():            yield {"title": h3} for        URL in Response.xpath ('//a/@href '). Extract ():            yield scrapy. Request (URL, callback=self.parse)

You can directly use Start_requests () instead of start_urls; Projects make it easier to get data:

Import scrapyfrom myproject.items import Myitemclass myspider (scrapy. Spider):    name = ' example.com '    allowed_domains = [' example.com ']    def start_requests (self):        yield Scrapy. Request (' http://www.example.com/1.html ', self.parse)        yield scrapy. Request (' http://www.example.com/2.html ', self.parse)        yield scrapy. Request (' http://www.example.com/3.html ', self.parse)    def parse (self, Response): For        H3 in Response.xpath ('// H3 '). Extract ():            yield myitem (title=h3) for        URLs in Response.xpath ('//a/@href '). Extract ():            yield scrapy. Request (URL, callback=self.parse)

Spider arguments

Crawlers can receive parameters that modify their behavior. Some common uses of crawler parameters are to define a starting URL or to restrict crawling to certain parts of the site, but they can be used to configure any functionality of the crawler.

The Spider crawl parameter is passed by command using the-a option. For example:

scrapy crawl myspider -a category=electronics

Crawlers can access the parameters in their init method:

Import Scrapyclass Myspider (scrapy. Spider):    name = ' Myspider '    def init (self, category=none, *args, **kwargs):        super (Myspider, self). Init (* args, **kwargs)        self.start_urls = [' http://www.example.com/categories/%s '% category]        # ...

The default init method acquires any crawler parameters and copies them as attributes to the crawler. The above example can also be written as follows:

Import Scrapyclass Myspider (scrapy. Spider):    name = ' Myspider '    def start_requests (self):        yield scrapy. Request (' http://www.example.com/categories/%s '% self.category)

Keep in mind that the spider parameter is just a string. Crawlers do not make any analysis of their own. If you want to set the Start_urls property from the command line, you must resolve it yourself to a list, use a property such as Ast.literal_eval or json.loads, and then set it as a property. Otherwise, you will result in iterating over a start_urls string (a very common python trap), resulting in each character being treated as a separate URL.

A valid use case is to set the HTTP authentication credentials used by Httpauthmiddleware or the user agent used by the user agent Useragentmiddleware:
scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot

Spider parameters can also be passed through Scrapyd Schedule.jsonapi. Please refer to the SCRAPYD documentation.

Universal crawler

Scrapy comes with some useful generic crawlers that you can use to subclass your crawler. Their goal is to provide handy features for some common crawl cases, such as viewing all links on the site according to certain rules, fetching or parsing xml/csv feeds from the sitemap.

For the example used in the following crawlers, we assume that you have TestItem a myproject.items project declared in the module:

Import Scrapyclass TestItem (scrapy. Item):    id = scrapy. Field ()    name = Scrapy. Field ()    description = Scrapy. Field ()

Crawler crawler

类 scrapy.spiders.CrawlSpider
This is the most commonly used crawler for crawling regular web sites, because it provides a convenient mechanism for following links by defining a set of rules. It may not be the most suitable for your particular website or project, but it is a few common enough cases, so you can start with it, overwrite more custom features as needed, or just implement your own crawlers.

In addition to the properties inherited from the spider (which you must specify), this class supports a new property:

rules
It is a list of one (or more) Rule objects. Each Rule defines some kind of behavior that crawls the site. The Rule object is described below. If more than one rule matches the same link, the first one is used according to the order in which they are defined in this property.

The crawler also exposes a method that can be overridden:

parse_start_url(response)
This method is called for the start_urls response. It allows parsing of the initial response and must return Item an object, an object, Request or an iterator containing any object.

Crawl rules

class scrapy.spiders.Rule(link_extractor,callback = None,cb_kwargs = None,follow = None,process_links = None,process_request = None )
link_extractoris a link extractor object that defines how the links are extracted from each crawled page surface.

callbackis a callable or string (in this case, the method that will use the Crawler object with that name) to invoke for each link extracted with the specified link_extractor. This callback receives a response as its first argument, and must return a list that contains the item and request objects (or any of their subclasses).

Warning
When writing crawl crawler rules, avoid using parse as callbacks, because CrawlSpider parse the method itself is used to implement its logic. So if you rewrite the parse method, the crawler will no longer work.

cb_kwargsIs the dict that contains the keyword arguments to pass to the callback function.

followis a Boolean value that specifies whether the link should be traced from each response that is extracted using this rule. If callback This is the None follow default True , the default is False .

process_linksis a callable or a string (in this case, the method that will use the Crawler object with that name), the method is invoked using the specified list of each link extracted from each response link_extractor . This is mainly used for filtering purposes.

process_requestis a callable or a string (in which case a crawler object with that name will be used), it will be called by each request fetched by this rule, and must return a request or none (filter out the request).

Crawler Crawler Example

Now let's look at a crawlspider example:

Import scrapyfrom scrapy.spiders import crawlspider, rulefrom scrapy.linkextractors import Linkextractorclass MySpider ( Crawlspider): name = ' example.com ' allowed_domains = [' example.com '] start_urls = [' http://www.example.com '] r  Ules = (# Extract links matching ' category.php ' (but not matching ' subsection.php ') # and follow links from        them (since no callback means follow=true by default).  Rule (Linkextractor (allow= (' category\.php ',), deny= (' subsection\.php ',)), # Extract links matching ' item.php ' and    Parse them with the spider's Method Parse_item Rule (Linkextractor (allow= (' item\.php ',)), callback= ' Parse_item '), ) def parse_item (self, Response): Self.logger.info (' Hi, this was an item page!%s ', Response.url) item = Scrapy. Item () item[' id '] = Response.xpath ('//td[@id = ' item_id ']/text () '). Re (r ' ID: (\d+) ') item[' name '] = RESPONSE.XP Ath ('//td[@id = ' item_name ']/text () '). Extract () item[' description '] = RESponse.xpath ('//td[@id = ' item_description ']/text () '). Extract () return item 

The crawler will begin to crawl example.com's homepage, collect category links and link items, and use the Parse_item method to parse the latter. For each project response, some data is extracted from the HTML using XPath and the item is populated with it.

Xmlfeedspider

class scrapy.spiders.XMLFeedSpider
Xmlfeedspider is designed to resolve XML feeds by iterating over XML feeds with specific node names. Iterators can be selected from: Iternodes,xml and HTML. Iternodes for performance reasons, it is recommended to use iterators because XML and iterator HTML generate the entire DOM at once in order to parse it. However, HTML can be useful as an iterator when parsing XML using bad tags.

To set iterators and tag names, you must define the following class properties:

    • iterator
      A string that defines the iterator to use. It can be:

      • 'iternodes'-Fast iterators based on regular expressions

      • 'html'-Use the iterator selector. Keep in mind that this uses DOM parsing and must load all the DOM in memory, which may be a big feed problem

      • 'xml'-Use the iterator selector. Keep in mind that this uses DOM parsing and must load all the DOM in memory, which may be a big feed problem
        It defaults to: 'iternodes' .

itertag
A string that has the name of the node (or element) to iterate over. Example:
itertag = 'product'

namespaces
Defines a list of tuples in the document that will use this crawler to process namespaces. In the method with the namespace that will be used for autoenrollment. (prefix, URI) prefixuriregister_namespace ()

You can then specify the Itertag node with the namespace in the property.

Cases:

Class Yourspider (Xmlfeedspider):    namespaces = [(' n ', ' http://www.sitemaps.org/schemas/sitemap/0.9 ')]    Itertag = ' N:url '    # ...

In addition to these new properties, the crawler also has the following overridable methods:

adapt_response(response)
A method that is received as soon as it arrives from the crawler middleware before the crawler begins parsing the response. It can be used to modify the response body before parsing. This method receives the response and returns the response (it can be the same or another).

parse_node(response, selector)
This method is called for nodes that match the provided tag name (Itertag). Receives selector per node response and. Overriding this method is required. Otherwise, your crawler will not work. This method must return an item object, a Request object, or an iterator that contains any object.

process_results(response, results)
For each result returned by the crawler (items or requests), this method will be called, and it will perform any final processing required before returning the result to the framework core, such as setting the project ID. It receives a list of results and a response that produces those results. It must return a list of results (Items or requests).

Xmlfeedspider Example

These crawlers are easy to use, let's look at an example:

From scrapy.spiders import xmlfeedspiderfrom myproject.items import Testitemclass myspider (xmlfeedspider):    name = ' example.com '    allowed_domains = [' example.com ']    start_urls = [' Http://www.example.com/feed.xml ']    iterator = ' iternodes '  # This is actually unnecessary, since it ' s default value    Itertag = ' item '    def parse _node (self, Response, node):        self.logger.info (' Hi, this is a <%s> node!:%s ', Self.itertag, '. Join ( Node.extract ()))        item = TestItem ()        item[' id '] = Node.xpath (' @id '). Extract ()        item[' name '] = Node.xpath (' Name '). Extract ()        item[' description '] = Node.xpath (' description '). Extract ()        return item

Basically what we do is create a crawler, download a start_urls from a given, then traverse each item tag, print it out, and store some random data item.

Csvfeedspider

class scrapy.spiders.CSVF
This crawler is very similar to Xmlfeedspider, except that it iterates over rows, not nodes. The method that is called in each iteration is Parse_row ().

delimiter
The delimited string for each field in the CSV file defaults to ', ' (comma).

quotechar
The string containing characters for each field in the CSV file defaults to ' "' (quotation marks).

headers
A list of rows contained in the file CSV feed that is used to extract fields from.

parse_row(response, row)
The key for each supplied (or detected) header using the CSV file receives the response and dict (for each row). This crawler also gives the opportunity to rewrite the Adapt_response and Process_results methods for the purpose of pre-and post-processing.

Csvfeedspider Example

Let's look at an example similar to the previous one, but use Csvfeedspider:

From scrapy.spiders import csvfeedspiderfrom myproject.items import Testitemclass myspider (csvfeedspider):    name = ' example.com '    allowed_domains = [' example.com ']    start_urls = [' Http://www.example.com/feed.csv ']    delimiter = '; '    QuoteChar = "'"    headers = [' id ', ' name ', ' description ']    def parse_row (self, Response, row):        Self.logger.info (' Hi, this is a row!:%r ', row)        item = TestItem ()        item[' id '] = row[' id ']        item[' name '] = row[' Name ']        item[' description '] = row[' description ']        return item

Sitemapspider

class scrapy.spiders.SitemapSpider
Sitemapspider allows you to crawl sites by using Sitemaps discovery URLs.

It supports nested sitemaps and discovers Sitemap URLs from robots.txt.

sitemap_urls
A list of URLs that point to the Web site you want to crawl.

You can also point to robots.txt, which resolves to extract the Sitemap URLs from.

sitemap_rules
Tuple list Where:(regex, callback)

    • A regex is a regular expression that matches a URL extracted from a sitemap. The regex can be either a str or a compiled regular expression object.

    • Callback is a callback used to handle URLs that match a regular expression. Callback can be a string (indicating the name of a spider method) or callable.

For example:
sitemap_rules = [('/product/', 'parse_product')]

The rules are applied sequentially, and only the first match will be used.
If this attribute is omitted, all URLs found in the site map are processed in the parse callback.

sitemap_follow
A list of regular expressions for the site map that should be followed. This applies only to sites that use a sitemap index file that points to another sitemap file.

By default, all site maps are tracked.

sitemap_alternate_links
Specifies whether the URL should follow an alternate link. These are links to the same Web site in another language that is passed in the same URL block.

For example:

<url>    <loc>http://example.com/</loc>    <xhtml:link rel= "alternate" hreflang= "de" href= "Http://example.com/de"/></url>

Use sitemap_alternate_linksset , this will retrieve two URLs. With sitemap_alternate_links disable, only http://example.com/will be retrieved.

The default is sitemap_alternate_links disabled.

Sitemapspider Example

The simplest example: Use the parse callback to handle all URLs found through the Sitemap:

From scrapy.spiders import Sitemapspiderclass myspider (sitemapspider):    sitemap_urls = [' http://www.example.com/ Sitemap.xml ']    def parse (self, Response):        Pass # ... scrape item here ...

Use a callback to process some URLs and use different callbacks to handle other URLs:

From scrapy.spiders import Sitemapspiderclass myspider (sitemapspider):    sitemap_urls = [' http://www.example.com/ Sitemap.xml ']    sitemap_rules = [        ('/product/', ' parse_product '),        ('/category/', ' parse_category '),    ]    def parse_product (self, Response):        Pass # ... scrape product ...    def parse_category (self, Response):        Pass # ... scrape category ...

Follow the sitemaps defined in the robots.txt file, and only keep track of sitemaps whose URLs contain the /sitemap_shop following:

From scrapy.spiders import Sitemapspiderclass myspider (sitemapspider):    sitemap_urls = [' http://www.example.com/ Robots.txt ']    sitemap_rules = [        ('/shop/', ' parse_shop '),    ]    sitemap_follow = ['/sitemap_shops ']    def parse_shop (self, Response):        Pass # ... scrape ...

Use Sitemapspider in conjunction with other source URLs:

from scrapy.spiders import sitemapspiderclass myspider (SitemapSpider): Sitemap _urls = [' http://www.example.com/robots.txt '] sitemap_rules = [('/shop/', ' parse_shop '),] other_urls = [ ' Http://www.example.com/about '] def start_requests (self): Requests = List (super (Myspider, self). Start_requests () ) Requests + = [Scrapy. Request (x, Self.parse_other) for x in Self.other_urls] return requests def parse_shop (self, response): PA SS # ... scrape shop here ... def parse_other (self, Response): Pass # ... scrape other here ... 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.