Python crawler scrapy scrapy terminal (scrapy Shell)

Last Update:2017-11-30 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Scrapy terminal is an interactive terminal for you to try and debug your crawl code without starting the spider. The intent is to test the code that extracts the data, but you can use it as a normal Python terminal to test any Python code on it.

The terminal is used to test XPath or CSS expressions to see how they work and the data extracted from the crawled pages. When writing your spider, the terminal provides the ability to interactively test your expression code, eliminating the hassle of running the spider after every modification.

Once you're familiar with the Scrapy terminal, you'll find that it has a huge role to play in developing and debugging spiders.

If you have IPython installed, the Scrapy terminal will use IPython (instead of the standard Python terminal). The IPython terminal is more powerful than others, providing intelligent auto-completion, highlighting output, and other features.

We highly recommend that you install IPython, especially if you are using a UNIX system (IPython works well under Unix). Please refer to IPython Installation guide for details.

Start the terminal

You can use shell to start the Scrapy terminal:

<url>Is the address of the webpage you want to crawl.

Scrapy Shell <url>

Print log:

Scrapy Shell ' http://scrapy.org '

Do not print logs:

Scrapy Shell ' http://scrapy.org '--nolog

Using the terminal

D:\ Project \ Small Project \scrapy_day6_httpbin\httpbin>scrapy shell "https://dig.chouti.com"  --nologhttps://www.zhihu.com/ Captcha.gif?r=1512028381914&type=login[s] Available scrapy objects:[s]   scrapy     scrapy module (contains Scrapy. Request, Scrapy. Selector, etc) [s]   crawler    <scrapy.crawler.crawler object at 0x04e60090>[s]   item       {}[s]   Request    <get Https://dig.chouti.com>[s]   response   <200 Https://dig.chouti.com>[s]   Settings   <scrapy.settings.settings object at 0x04e60390>[s]   spider     <defaultspider ' default ' at 0x5a23f70>[s] useful shortcuts:[s "   fetch (url[, redirect=true]) Fetch URL and update local objects (by default , redirects is followed) [s]   fetch (req)                  fetch a scrapy. Request and update local objects[s]   shelp ()           Shell Help (print this help) [s]   view (response)    view Response in a browser

The Scrapy terminal is just a normal Python terminal (or IPython). It provides some additional shortcuts.

Available shortcut commands (shortcut)

shelp()-Help list to print available objects and shortcut commands

fetch(request_or_url)-Get a new response based on a given request or URL and update related objects

view(response)-Opens the given response in the native browser. It adds a <base> tag to the body of the response, allowing external links (such as CSS) to be displayed correctly. Note that this action creates a temporary file locally and the file is not automatically deleted.

Available Scrapy objects

The Scrapy terminal automatically creates some easy-to-use objects, such as Response objects and Selector objects (for HTML and XML content), based on the downloaded page.

These objects are:

crawler-The current Crawler object.

spider-Spider that handles URLs. A spider that is not processed for the current URL is an Spider object.

request-The object that was recently fetched to the page Request . You can use to replace() modify the request. Or use fetch a shortcut to get a new request.

response-The object that contains the most recently acquired page Response .

sel-Objects built according to the recently acquired response Selector .

settings-Current Scrapy settings

Print the status code for the current request:

>>> response
<200 https://dig.chouti.com>

>>> response.headers
{b ' Date ': [B ' Thu, 09:45:06 GMT '], b ' Content-type ': [b ' text/html; Charset=utf-8 '], b ' Server ': [b ' Tengine '], B ' C Ontent-language ': [b ' en '], b ' X-via ': [B ' 1.1 bd157:10 (Cdn Ca
Che Server V2.0) '}

Try our XPath expression to extract the content

>>> Sel.xpath ('//a[@class = ' show-content color-chag ']/text () '). Extract_first () ' \n\t\t\t\t\t\t\t\t\t\t\t\ N\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\ t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tt\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t "Thunderbolt Awards maintenance Company benefits Employees Each prize 100,000 "November 30, Thunder and Thunder Big Data recently occurred" infighting ", the two sides issued a number of announcements accusing each other. In this, Thunderbolt issued internal mail, commendation at the critical moment to maintain the interests of the company's 5 employees, and give each person 100,000 yuan reward. \n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t ' >>> sel.xpath ('//a[@class = " Show-content Color-chag "]/text ()"). Extract_first (). Strip () "Thunderbolt Awards maintenance company benefits employees each prize 100,000" November 30, Thunder and Thunder Big Data recently occurred "infighting", The two sides issued several announcements of accusations against each other. In this, Thunderbolt issued internal mail, commendation at the critical moment to maintain the interests of the company's 5 employees, and give each person 100,000 yuan reward. ‘

You can also use CSS to extract

>>> sel.css ('. Part1 a::text '). Extract_first (). Strip () "Netflix buys" the overseas distribution rights of "White Nights", will be broadcast in more than 190 countries and regions '

The view is interesting, it is actually to save the downloaded HTML.

>>> View (response)
True

Print the URL of the current request

>>> Response.url
' Https://dig.chouti.com '

Python crawler scrapy scrapy terminal (scrapy Shell)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More