What is a scrapy shell?
The Scrapy terminal is an interactive terminal that allows us to try and debug the code without starting the spider, or to test XPath or CSS expressions to see how they work and to easily crawl the data in the page.
Selector selector (Scrapy built-in)
Selecctor has four basic methods, the most common of which is XPath:
- XPath (): An XPath expression that returns the list of selector for all nodes corresponding to the
- Extract (): Serializes the node to a Unicode string and returns a list
- CSS (): An incoming CSS expression that returns a list of all the nodes that correspond to the expression, with the syntax BeautifulSoup4
- Re (): Extracts the data based on the incoming regular expression, returning a list of Unicode strings
What does the spider do? What is a spider?
The spider class defines how to crawl a (certain) Web site. Includes crawl actions (for example, whether to follow links) and how to extract structured data (crawled item) from Ang. In other words, a spider is a defined crawl action and a place to parse a Web page.
Properties and methods of the spider
Main properties and methods:
Name: A string that defines the name of the spider. For example, if the spider crawls website.com, the spider is typically named website
Allowed_domains: Contains a list of domain names (domains) that are allowed for crawling, optional.
Start_url: The Ganso or List of the initial URL. When a specific URL is not given, the spider starts crawling from the list.
Start_requests (self): The method returns an iterative object (iterable). The object contains the first request of the spider for crawling (the default implementation is the URL in Start_url).
Parse (self, Response): The default Request object callback function when requesting a URL to return a page without specifying a callback function. Used to handle the response returned by the Web page, as well as the generated item or request object.
How do I write spider crawl data?
See:
Scrapy knowledge supplement--scrapy shell and Spider