Scrapy crawler tutorial 4 Spider)

Source: Internet
Author: User
Tags http authentication
Python version management: pyenv and pyenvvirtualenvScrapy crawler Getting Started Tutorial 1 installation and basic use Scrapy crawler Getting Started Tutorial 2 DemoScrapy crawler Getting Started Tutorial 3 command line tool introduction and example Scrapy crawler getting started tutorial 4 Spider) scrapy crawler Getting Started Tutorial 5 Selectors (selector) Scrapy crawler Getting Started Tutorial 6 Items (project) Scrapy crawler Getting Started Tutorial 7 ItemLoa... # wiki/1514.html "target =" _ blank "> Python version management: pyenv and pyenv-virtualenv
Scrapy crawler tutorial 1 installation and basic use
Scrapy crawler tutorial 2 official Demo
Scrapy crawler tutorial 3 introduction and examples of command line tools
Scrapy crawler tutorial 4 Spider)
Scrapy crawler tutorial 5 Selectors (selector)
Scrapy crawler tutorial 6 Items (project)
Scrapy crawler tutorial 7 Item Loaders (project loader)
Scrapy crawler tutorial 8 interactive shell debugging
Scrapy crawler Tutorial 9 Item Pipeline (project Pipeline)
Scrapy crawler tutorial 10 Feed exports (export file)
Scrapy crawler tutorial 11 Request and Response (Request and Response)
Scrapy crawler tutorial 12 Link Extractors)

[Toc]

Development Environment:
Python 3.6.0(Currently up to date)
Scrapy 1.3.2(Currently up to date)

Spider

A crawler is a class that defines how to capture a website (or a group of websites), including how to capture (that is, focus on links) and how to extract structured data (that is, crawling projects) from its webpage ). In other words, Spider is a location that you define to capture and parse custom web pages for a specific website (or in some cases, a group of websites.

For crawlers, the loop goes through something like this:

  1. You first generate an initial request to capture the first URL, and then specify the callback function to be called using the response downloaded from these requests.

    The first execution Request is obtained by calling the start_requests () (by default) Request as the parse method generated in start_urls and the specified URL, and this method is used as the callback function of the Request.

  2. In the callback function, you will parse the response (webpage) and return the iteratable objects with extracted data, Item objects, Request objects, or these objects. These requests will also contain callbacks (which may be the same), then be downloaded by Scrapy, and then the specified callbacks process their responses.

  3. In callback functions, you usually use selector to parse the page content (but you can also use BeautifulSoup, lxml or any mechanism you like) and use parsed data to generate a project.

  4. Finally, the items returned from the crawler are usually permanently stored in the database (in some project pipelines) or written into files using Feed export.

Even if this loop (more or less) applies to any type of crawlers, different types of default crawlers are bundled into Scrapy for different purposes. We will talk about these types here.

class scrapy.spiders.Spider

This is the simplest Crawler. every other crawler must inherit the crawler (including the crawler bound with Scrapy and the crawler you write ). It does not provide any special features. It only provides a defaultstart_requests()Implementation, it starts fromstart_urlsspiderSend the request andparseResponse call for each resultspider.

name
The string that defines the crawler name. The crawler name is how the crawler is located (and instantiated) by Scrapy, so itMust be unique. However, there is nothing to prevent you from instantiating multiple instances of the same crawler. This is the most important crawler attribute and is required.

If a crawler crawls a single domain name, the common practice is to name the crawler after the domain. Therefore, for example, mywebsite.com is usually called.

Note:
In Python 2, this must be ASCII.

allowed_domains
This is an optional list of the strings of the fields that can be crawled by the crawler. if you specify a list to be crawled, the other fields will not be crawled.

start_urls
If no specific URL is specified, crawlers start to crawl the URL list.

custom_settings
When running this crawler, the setting dictionary will be overwritten by the project width configuration. It must be defined as a class property because the settings are updated before instantiation.

For a list of available built-in settings, see built-in settings reference.

crawler
This property from_crawler () is set by the class method after the class is initialized, and links the Crawler to the object bound to this Crawler instance.

Crawlers encapsulate many components in the project for single entry access (such as expansion, middleware, and signal manager ). For more information, see Capture tool API.

settings
Run the configuration of this crawler. This is a Settings instance. For more information about this topic, see set a topic.

logger
Python recorder created with Spidername. You can use it to send log messages, as described in record crawlers.

from_crawler(Crawler,Args ,*Kwargs)
Is the class method that Scrapy uses to create crawlers.

You may not need to directly overwrite this, because the default implementation acts as a method proxy,init()Call the given args parameter and the named kwargs parameter.

However, this method sets the crawler and settings attributes in the new instance so that they can be accessed in the crawler later.

  • Parameters:

    • Crawler (Crawlerinstance)-crawlers bound

    • Args (list)-passInit() Method parameters

    • Kwargs (dict)-passInit() Method keyword parameter

start_requests()
This method must return an iterative first request to capture the crawler.

With start_requests (), you do not need to write start_urls.

The default implementation is start_urls, but the method start_requests can be rewritten.
For example, if you need to use POST to log on, you can:

class MySpider(scrapy.Spider):    name = 'myspider'    def start_requests(self):        return [scrapy.FormRequest("http://www.example.com/login",                                   formdata={'user': 'john', 'pass': 'secret'},                                   callback=self.logged_in)]    def logged_in(self, response):        # here you would extract links to follow and return Requests for        # each of them, with another callback        pass

make_requests_from_url(url)
A method that receives a URL and returns a Request object (or a list of Request objects) for crawling. This method is used to construct the initial request start_requests () in the method, and is usually used to convert the URL into a request.

Unless overwritten, this method returns Requests parse () with the method as their callback function and enables the dont_filter parameter (for more information about the Request, see the class ).

parse(response)
This is the default callback that Scrapy uses to process download responses.

This parse method is used to process the response and return the captured data or more URLs. Other request callbacks have the same requirements as the Spider class.

This method and any other Request callback must return an iteratable Request, dicts, or Item object.

  • Parameters:

    • Response (Response)-resolution response

log(message[, level, component])
The wrapper sends log messages to logger through crawlers to maintain backward compatibility. For more information, see record from Spider.

closed(reason)
Summon a crawler when it is disabled. This method provides a shortcut for signals. connect () of the spider_closed signal.

Let's take an example:

import scrapyclass MySpider(scrapy.Spider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = [        'http://www.example.com/1.html',        'http://www.example.com/2.html',        'http://www.example.com/3.html',    ]    def parse(self, response):        self.logger.info('A response from %s just arrived!', response.url)

Multiple requests and items are returned from a single callback:

import scrapyclass MySpider(scrapy.Spider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = [        'http://www.example.com/1.html',        'http://www.example.com/2.html',        'http://www.example.com/3.html',    ]    def parse(self, response):        for h3 in response.xpath('//h3').extract():            yield {"title": h3}        for url in response.xpath('//a/@href').extract():            yield scrapy.Request(url, callback=self.parse)

You can directly use start_requests () instead of start_urls; projects can be more convenient to get data:

import scrapyfrom myproject.items import MyItemclass MySpider(scrapy.Spider):    name = 'example.com'    allowed_domains = ['example.com']    def start_requests(self):        yield scrapy.Request('http://www.example.com/1.html', self.parse)        yield scrapy.Request('http://www.example.com/2.html', self.parse)        yield scrapy.Request('http://www.example.com/3.html', self.parse)    def parse(self, response):        for h3 in response.xpath('//h3').extract():            yield MyItem(title=h3)        for url in response.xpath('//a/@href').extract():            yield scrapy.Request(url, callback=self.parse)
Spider arguments

Crawlers can receive parameters that modify their behavior. Some common uses of crawler parameters are to define the starting URL or restrict crawling to certain parts of the website, but they can be used to configure any crawler function.

The Spider crawl parameter is passed through the command using this-a option. For example:

scrapy crawl myspider -a category=electronics

Crawlers canInitAccess parameters in method:

import scrapyclass MySpider(scrapy.Spider):    name = 'myspider'    def init(self, category=None, *args, **kwargs):        super(MySpider, self).init(*args, **kwargs)        self.start_urls = ['http://www.example.com/categories/%s' % category]        # ...

DefaultInitThe method obtains any crawler parameters and copies them as attributes to the crawler. The preceding example can also be written as follows:

import scrapyclass MySpider(scrapy.Spider):    name = 'myspider'    def start_requests(self):        yield scrapy.Request('http://www.example.com/categories/%s' % self.category)

Remember that the spider parameter is only a string. Crawlers do not perform any parsing on their own. If you want to set the start_urls attribute from the command line, you must resolve it to a list, use attributes such as ast. literal_eval or json. loads, and set it to an attribute. Otherwise, you will iterate a start_urls string (a very common python trap), and each character will be treated as a separate url.

A valid use case is to set the http authentication credential HttpAuthMiddleware or the user agent UserAgentMiddleware used by the user proxy:
scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot

The Spider parameter can also be passed through Scrapyd schedule. jsonAPI. See Scrapyd.

General crawlers

Scrapy comes with some useful general crawlers that you can use to subclass your crawlers. They aim to provide convenient functions for some common capture cases, such as viewing all links on the website based on certain rules, capturing or parsing XML/CSV Feed from the site map.

For the examples used in the following crawler, we assume that you haveTestItemInmyproject.itemsProject declared in the module:

import scrapyclass TestItem(scrapy.Item):    id = scrapy.Field()    name = scrapy.Field()    description = scrapy.Field()
Crawling crawlers

Scrapy. spiders. crawler
This is the most common crawling of regular websites, because it provides a convenient mechanism for the following links by defining a set of rules. It may not be the most suitable for your specific website or project, but it is common enough, so you can start from it and cover more custom functions as needed, or just implement your own crawler.

In addition to attributes inherited from Spider (which must be specified), this class supports a new attribute:

rules
It is one (or more)RuleObject List. EveryRuleDefines a behavior of crawling a website. The rule object is described as follows. If multiple rules match the same link, the first rule is used according to the sequence defined in this attribute.

This crawler also exposes overwriting methods:

parse_start_url(response)
Call this method for the start_urls response. It allows parsing the initial response and must returnItemObject,RequestAn object or an iterator that contains any object.

Capture rules

class scrapy.spiders.Rule(link_extractor,callback = None,cb_kwargs = None,follow = None,process_links = None,process_request = None )
link_extractorIs a link extraction program object that defines how to extract links from each crawling page.

callbackIs a callable or string (in this case, a crawler object with this name will be used) to call each link extracted using the specified link_extractor. This callback receives a response as its first parameter and must return a list of items and Request objects (or any of their subclasses.

Warning
When writing crawler rules, avoid usingparseAs a callback, becauseCrawlSpiderUseparseMethod to implement its logic. So if you overwriteparseMethod, crawlers will no longer work.

cb_kwargsIs the dict that contains the keyword parameter to be passed to the callback function.

followIs a Boolean value that specifies whether to trace links from each response extracted using this rule. IfcallbackYesNone followDefault value:TrueOtherwise, the default value isFalse.

process_linksIs a callable or string (in this case, a crawler object with this name will be used), and this method will be called using the list of links specified to be extracted from each responselink_extractor. This is mainly used for filtering purposes.

process_requestIs a callable or string (in this case, the method of the crawler object with this name will be used), it will be called by each request extracted by this rule, and you must return a request or none (filter out the request ).

Crawler crawling example

Now let's take a look at the example of a crawler:

import scrapyfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = ['http://www.example.com']    rules = (        # Extract links matching 'category.php' (but not matching 'subsection.php')        # and follow links from them (since no callback means follow=True by default).        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),        # Extract links matching 'item.php' and parse them with the spider's method parse_item        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),    )    def parse_item(self, response):        self.logger.info('Hi, this is an item page! %s', response.url)        item = scrapy.Item()        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()        return item

This crawler will capture the home page of example.com, collect category links and links, and parse the latter using the parse_item method. For each project response, XPath is used to extract some data from HTML and fill the Item with it.

XMLFeedSpider

class scrapy.spiders.XMLFeedSpider
XMLFeedSpider is designed to parse XML subscription sources by iteration of XML subscription sources with specific node names. Iterators can be selected from iternodes, xml, and html. Iternodes is recommended for performance reasons because xml and the iterator html generate the entire DOM at a time to parse it. However, html may be useful as an iterator when parsing XML with bad tags.

To set the iterator and tag name, you must define the following class attributes:

  • iterator
    Defines the string of the iterator to be used. It can be:

    • 'iternodes'-Fast iterator based on regular expressions

    • 'html'-Used iterator Selector. Keep in mind that this uses DOM parsing and must load all the DOM in the memory, which may be a big feed problem

    • 'xml'-Used iterator Selector. Keep in mind that this uses DOM parsing and must load all the DOM in the memory, which may be a big feed problem
      It defaults:'iternodes'.

itertag
A string with the name of the node (or element) to be iterated. Example:
itertag = 'product'

namespaces
Defines the list of tuples in the namespace processed by this crawler in this document. In and The namespace method that will be used for automatic registration. (Prefix, uri) prefixuriregister_namespace ()

Then, you can specify the itertag nodes with namespaces in the attributes.

Example:

class YourSpider(XMLFeedSpider):    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]    itertag = 'n:url'    # ...

In addition to these new attributes, this crawler also has the following override methods:

adapt_response(response)
A method that receives the response immediately when the crawler middleware arrives before the crawler starts parsing the response. It can be used to modify the response body before resolution. This method receives the response and returns the response (it can be the same or another ).

parse_node(response, selector)
This method is called for nodes that match the provided tag name (itertag. Receives the response and response from each node of the Selector. Override this method is required. Otherwise, your crawler will not work. This method must return an Item object, a Request object, or an iterator containing any object.

process_results(response, results)
For each result returned by the crawler (Items or Requests), this method is called and it will execute any final processing required before returning the result to the framework core, such as setting the Project ID. It receives a list of results and generates responses to those results. It must return the result list (Items or Requests ).

XMLFeedSpider example

These crawlers are easy to use. let's look at an example:

from scrapy.spiders import XMLFeedSpiderfrom myproject.items import TestItemclass MySpider(XMLFeedSpider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = ['http://www.example.com/feed.xml']    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value    itertag = 'item'    def parse_node(self, response, node):        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))        item = TestItem()        item['id'] = node.xpath('@id').extract()        item['name'] = node.xpath('name').extract()        item['description'] = node.xpath('description').extract()        return item

Basically, we create a crawler, download a start_urls from the given one, traverse each item tag, print it out, and store some random data items.

CSVFeedSpider

class scrapy.spiders.CSVF
This crawler is very similar to XMLFeedSpider, except that it iterates rows rather than nodes. The method called in each iteration is parse_row ().

delimiter
The default delimiter character string for each field in the CSV file is ',' (comma ).

quotechar
The default character string for each field in the CSV file is '"' (quotation marks ).

headers
The list of rows contained in the file CSV Feed, used to extract fields from it.

parse_row(response, row)
Use the keys of each provided (or detected) header of the CSV file to receive responses and dict (indicating each line ). This crawler also gives the opportunity to rewrite the pre-and post-processing objectives of the adapt_response and process_results methods.

CSVFeedSpider example

Let's look at a similar example, but use CSVFeedSpider:

from scrapy.spiders import CSVFeedSpiderfrom myproject.items import TestItemclass MySpider(CSVFeedSpider):    name = 'example.com'    allowed_domains = ['example.com']    start_urls = ['http://www.example.com/feed.csv']    delimiter = ';'    quotechar = "'"    headers = ['id', 'name', 'description']    def parse_row(self, response, row):        self.logger.info('Hi, this is a row!: %r', row)        item = TestItem()        item['id'] = row['id']        item['name'] = row['name']        item['description'] = row['description']        return item
SitemapSpider

class scrapy.spiders.SitemapSpider
SitemapSpider allows you to capture websites by using Sitemaps to discover websites.

Supports embedded sitemapand detects Sitemap URLs from robots.txt.

sitemap_urls
The list of URLs that point to the websites you want to crawl.

You can also point to robots.txt, which resolves to extract the Sitemap URL from it.

sitemap_rules
In the list of tuples:(regex, callback)

  • Regex is a regular expression that matches the URLs extracted from Sitemap. Regex can be a str or a compiled regular expression object.

  • Callback is used to process the callback of a url that matches the regular expression. Callback can be a string (indicating the name of the spider method) or callable.

For example:
sitemap_rules = [('/product/', 'parse_product')]

Rules are applied sequentially. only the first matching rule will be used.
If this attribute is omitted, all URLs found in the site map are processed in the parse callback.

sitemap_follow
List of regular expressions of website maps to be followed. This only applies to websites that use Sitemap index files that direct to other Sitemap files.

By default, all Website maps are tracked.

sitemap_alternate_links
Specify whether the url should follow a backup link. These are links of the same website in another language passed in the same url block.

For example:

     
  
   http://example.com/</loc>    
   
  
 

Usesitemap_alternate_linkssetThis will retrieve two URLs. Withsitemap_alternate_linksDisabled. only http://example.com/will be included.

Default value:sitemap_alternate_linksDisable.

SitemapSpider example

The simplest example is to use the parse callback to process all URLs found through the site map:

from scrapy.spiders import SitemapSpiderclass MySpider(SitemapSpider):    sitemap_urls = ['http://www.example.com/sitemap.xml']    def parse(self, response):        pass # ... scrape item here ...

Use a callback to process some URLs and use different callbacks to process other URLs:

from scrapy.spiders import SitemapSpiderclass MySpider(SitemapSpider):    sitemap_urls = ['http://www.example.com/sitemap.xml']    sitemap_rules = [        ('/product/', 'parse_product'),        ('/category/', 'parse_category'),    ]    def parse_product(self, response):        pass # ... scrape product ...    def parse_category(self, response):        pass # ... scrape category ...

The sitemaps defined in the robots.txt file, and only tracks its website content/sitemap_shopSitemap of the following content:

from scrapy.spiders import SitemapSpiderclass MySpider(SitemapSpider):    sitemap_urls = ['http://www.example.com/robots.txt']    sitemap_rules = [        ('/shop/', 'parse_shop'),    ]    sitemap_follow = ['/sitemap_shops']    def parse_shop(self, response):        pass # ... scrape shop here ...

Use SitemapSpider with other source URLs:

from scrapy.spiders import SitemapSpiderclass MySpider(SitemapSpider):    sitemap_urls = ['http://www.example.com/robots.txt']    sitemap_rules = [        ('/shop/', 'parse_shop'),    ]    other_urls = ['http://www.example.com/about']    def start_requests(self):        requests = list(super(MySpider, self).start_requests())        requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]        return requests    def parse_shop(self, response):        pass # ... scrape shop here ...    def parse_other(self, response):        pass # ... scrape other here ...

The above is the details of Scrapy crawler tutorial 4 Spider. For more information, see other related articles in the first PHP community!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.