What is a reptile?
From a logical point of view, a reptile corresponds to a tree. Branches are web pages, and leaves are information of interest.
When we look for interesting information from a URL, the content returned by the current URL may contain information that we are interested in, or it may contain another URL that may contain information that we are interested in. A reptile corresponding to a search for information, the information search process will establish a tree.
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/58/6F/wKioL1Sv6OmBHQnEAACEFi2pjKg209.jpg "title=" Spider.png "alt=" Wkiol1sv6ombhqneaacefi2pjkg209.jpg "/>
Scrapy. Spider This class provides an interface that allows us to design the entire information search process.
Pass the parameters required by the runtime to the spider. Like URLs? Parameter information after the number. This information can optionally be passed using the CRAWL-A command
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/58/6F/wKioL1Sv6xajm8xKAANRGRppZX4042.jpg "title=" Param.png "alt=" Wkiol1sv6xajm8xkaanrgrppzx4042.jpg "/>
2. Spider cycle
For spiders, crawling loops are similar to the following:
Initializes the request with the initial URL and sets the callback function. When the request is downloaded and returned, response is generated and passed as a parameter to the callback function.
The initial request in the spider is done by invoking thestart_requests ()To get that.start_requests ()ReadStart_urlsIn the URL, and toParseGenerate for callback functionRequest。
Parse the Returned (Web page) content within the callback function and returnItemObject orRequestOr an iterative container that includes both. The returned Request object is then scrapy processed, downloads the content, and invokes the set callback function (the function can be the same).
Within the callback function, you can use the selector (selectors) (you can also use BeautifulSoup, lxml, or any parser you want) to parse the Web page content and generate item based on the analyzed data.
Finally, the item returned by the spider will be stored in the database (handled by some Item Pipeline ) or stored in a file using the Feed exports .
Scrapy provides the default implementations of several spiders, which are highlighted below: The default Spider and Crawlspider
3. The simplest spider (The default spider)
Construct the Request object with the URL in instance property Start_urls
The framework is responsible for executing the request
Pass the response object returned by request to the Parse method for analysis
The simplified source code:
Class spider (object_ref): "" "Base class for scrapy spiders. All spiders must inherit from this class. "" " name = none def __init__ (Self, name=none, **kwargs): if name is not none: self.name = name elif not getattr (self, ' name ', None): raise valueerror ("%s must have a name " % type (self). __name__) Self.__dict__.update (Kwargs) if not hasattr (self, ' Start_urls '):   &Nbsp; self.start_urls = [] def start_requests (self): for url in Self.start_urls: yield self.make_ Requests_from_url (URL) def make_requests_from_url (self, url): return request (url, dont_filter=true) def parse (Self, response): raise Notimplementederrorbasespider = create_deprecated_class (' Basespider ', Spider)
A callback function returns an example of multiple request
Import scrapyfrom myproject.items import myitemclass myspider (scrapy. Spider): name = ' example.com ' allowed_domains = [' example.com '] start_urls = [ ' http://www.example.com/1.html ', '/http Www.example.com/2.html ', ' http://www.example.com/3.html ', ] def parse (Self, response): sel = scrapy. Selector (response) for h3 in response.xpath ('// H3 '). Extract (): yield myitem (title =H3) for url in response.xpath ('//a/@href '). Extract): yield scrapy. Request (Url, callback=self.parse)
Only two parameters are required to construct a Request object: URL and callback function
4. Crawlspider
Usually we need to decide in the spider: which links on the pages need to follow up, which pages end there, no need to follow the links inside. Crawlspider provides us with useful abstract--rule to make this kind of crawl task simple. You just have to tell scrapy in the rule which ones need to follow.
Recall the spider that we crawled Mininova website.
Class mininovaspider (crawlspider): name = ' Mininova ' allowed_domains = [' mininova.org '] start_urls = ['/http Www.mininova.org/yesterday '] rules = [rule (linkextractor (allow=['/tor/\d+ ')), ' parse_torrent ')] def parse_torrent (self, response): torrent = torrentitem () torrent[' url '] = response.url torrent[' name '] = response.xpath ("//h1/text ()"). Extract () torrent[' Description '] = response.xpath ("//div[@id = ' description ']"). Extract () torrent[' Size '] = response.xpath ("//div[@id = ' specifications ']/p[2]/text () [2]"). Extract ()    &NBSp; return torrent
The meaning of the rules in the above code is: match the content returned by the/tor/\d+ URL, hand it over to parse_torrent, and no longer follow the URL on response.
An example is also available in the official documentation:
Rules = (# Extract links matching ' category.php ' (but not match ' subsection.php ') and follow up links (no callback means follow defaults to true) Rule (Linkextrac Tor (allow= (' category\.php ',), deny= (' subsection\.php ',)), # Extract the link matching ' item.php ' and use the spider's Parse_item method for analysis Rule (Linkextractor (allow= (' item\.php ',)), callback= ' Parse_item '),)
In addition to spiders and crawlspider, there are Xmlfeedspider, Csvfeedspider, Sitemapspider
Python crawler Frame Scrapy Learning Note 8----Spider