The Scrapy terminal is an interactive terminal for you to try and debug your crawl code without starting the spider. The intent is to test the code that extracts the data, but you can use it as a normal Python terminal to test any Python code on it.
The terminal is used to test XPath or CSS expressions to see how they work and the data extracted from the crawled pages. When writing your spider, the terminal provides the ability to interactively test your expression code, eliminating the hassle of running the spider after every modification.
Once you're familiar with the Scrapy terminal, you'll find that it has a huge role to play in developing and debugging spiders.
If you have IPython installed, the Scrapy terminal will use IPython (instead of the standard Python terminal). The IPython terminal is more powerful than others, providing intelligent auto-completion, highlighting output, and other features.
We highly recommend that you install IPython, especially if you are using a UNIX system (IPython works well under Unix). Please refer to IPython Installation guide for details.
Start the terminal
You can use shell
to start the Scrapy terminal:
<url>
Is the address of the webpage you want to crawl.
Scrapy Shell <url>
Print log:
Scrapy Shell ' http://scrapy.org '
Do not print logs:
Scrapy Shell ' http://scrapy.org '--nolog
Using the terminal
D:\ Project \ Small Project \scrapy_day6_httpbin\httpbin>scrapy shell "https://dig.chouti.com" --nologhttps://www.zhihu.com/ Captcha.gif?r=1512028381914&type=login[s] Available scrapy objects:[s] scrapy scrapy module (contains Scrapy. Request, Scrapy. Selector, etc) [s] crawler <scrapy.crawler.crawler object at 0x04e60090>[s] item {}[s] Request <get Https://dig.chouti.com>[s] response <200 Https://dig.chouti.com>[s] Settings <scrapy.settings.settings object at 0x04e60390>[s] spider <defaultspider ' default ' at 0x5a23f70>[s] useful shortcuts:[s " fetch (url[, redirect=true]) Fetch URL and update local objects (by default , redirects is followed) [s] fetch (req) fetch a scrapy. Request and update local objects[s] shelp () Shell Help (print this help) [s] view (response) view Response in a browser
The Scrapy terminal is just a normal Python terminal (or IPython). It provides some additional shortcuts.
Available shortcut commands (shortcut)
shelp()
-Help list to print available objects and shortcut commands
fetch(request_or_url)
-Get a new response based on a given request or URL and update related objects
view(response)
-Opens the given response in the native browser. It adds a <base> tag to the body of the response, allowing external links (such as CSS) to be displayed correctly. Note that this action creates a temporary file locally and the file is not automatically deleted.
Available Scrapy objects
The Scrapy terminal automatically creates some easy-to-use objects, such as Response
objects and Selector
objects (for HTML and XML content), based on the downloaded page.
These objects are:
crawler
-The current Crawler
object.
spider
-Spider that handles URLs. A spider that is not processed for the current URL is an Spider
object.
request
-The object that was recently fetched to the page Request
. You can use to replace()
modify the request. Or use fetch
a shortcut to get a new request.
response
-The object that contains the most recently acquired page Response
.
sel
-Objects built according to the recently acquired response Selector
.
settings
-Current Scrapy settings
Print the status code for the current request:
>>> response
<200 https://dig.chouti.com>
>>> response.headers
{b ' Date ': [B ' Thu, 09:45:06 GMT '], b ' Content-type ': [b ' text/html; Charset=utf-8 '], b ' Server ': [b ' Tengine '], B ' C Ontent-language ': [b ' en '], b ' X-via ': [B ' 1.1 bd157:10 (Cdn Ca
Che Server V2.0) '}
Try our XPath expression to extract the content
>>> Sel.xpath ('//a[@class = ' show-content color-chag ']/text () '). Extract_first () ' \n\t\t\t\t\t\t\t\t\t\t\t\ N\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\ t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tt\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t "Thunderbolt Awards maintenance Company benefits Employees Each prize 100,000 "November 30, Thunder and Thunder Big Data recently occurred" infighting ", the two sides issued a number of announcements accusing each other. In this, Thunderbolt issued internal mail, commendation at the critical moment to maintain the interests of the company's 5 employees, and give each person 100,000 yuan reward. \n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t ' >>> sel.xpath ('//a[@class = " Show-content Color-chag "]/text ()"). Extract_first (). Strip () "Thunderbolt Awards maintenance company benefits employees each prize 100,000" November 30, Thunder and Thunder Big Data recently occurred "infighting", The two sides issued several announcements of accusations against each other. In this, Thunderbolt issued internal mail, commendation at the critical moment to maintain the interests of the company's 5 employees, and give each person 100,000 yuan reward. ‘
You can also use CSS to extract
>>> sel.css ('. Part1 a::text '). Extract_first (). Strip () "Netflix buys" the overseas distribution rights of "White Nights", will be broadcast in more than 190 countries and regions '
The view is interesting, it is actually to save the downloaded HTML.
>>> View (response)
True
Print the URL of the current request
>>> Response.url
' Https://dig.chouti.com '
Python crawler scrapy scrapy terminal (scrapy Shell)