Python crawler scrapy scrapy terminal (scrapy Shell)

Source: Internet
Author: User
Tags xpath

The Scrapy terminal is an interactive terminal for you to try and debug your crawl code without starting the spider. The intent is to test the code that extracts the data, but you can use it as a normal Python terminal to test any Python code on it.

The terminal is used to test XPath or CSS expressions to see how they work and the data extracted from the crawled pages. When writing your spider, the terminal provides the ability to interactively test your expression code, eliminating the hassle of running the spider after every modification.

Once you're familiar with the Scrapy terminal, you'll find that it has a huge role to play in developing and debugging spiders.

If you have IPython installed, the Scrapy terminal will use IPython (instead of the standard Python terminal). The IPython terminal is more powerful than others, providing intelligent auto-completion, highlighting output, and other features.

We highly recommend that you install IPython, especially if you are using a UNIX system (IPython works well under Unix). Please refer to IPython Installation guide for details.

Start the terminal

You can use shell to start the Scrapy terminal:

<url>Is the address of the webpage you want to crawl.

Scrapy Shell <url>

Print log:

Scrapy Shell ' http://scrapy.org '

Do not print logs:

Scrapy Shell ' http://scrapy.org '--nolog

  

Using the terminal
D:\ Project \ Small Project \scrapy_day6_httpbin\httpbin>scrapy shell "https://dig.chouti.com"  --nologhttps://www.zhihu.com/ Captcha.gif?r=1512028381914&type=login[s] Available scrapy objects:[s]   scrapy     scrapy module (contains Scrapy. Request, Scrapy. Selector, etc) [s]   crawler    <scrapy.crawler.crawler object at 0x04e60090>[s]   item       {}[s]   Request    <get Https://dig.chouti.com>[s]   response   <200 Https://dig.chouti.com>[s]   Settings   <scrapy.settings.settings object at 0x04e60390>[s]   spider     <defaultspider ' default ' at 0x5a23f70>[s] useful shortcuts:[s "   fetch (url[, redirect=true]) Fetch URL and update local objects (by default , redirects is followed) [s]   fetch (req)                  fetch a scrapy. Request and update local objects[s]   shelp ()           Shell Help (print this help) [s]   view (response)    view Response in a browser

  

The Scrapy terminal is just a normal Python terminal (or IPython). It provides some additional shortcuts.

Available shortcut commands (shortcut)
  • shelp()-Help list to print available objects and shortcut commands
  • fetch(request_or_url)-Get a new response based on a given request or URL and update related objects
  • view(response)-Opens the given response in the native browser. It adds a <base> tag to the body of the response, allowing external links (such as CSS) to be displayed correctly. Note that this action creates a temporary file locally and the file is not automatically deleted.
Available Scrapy objects

The Scrapy terminal automatically creates some easy-to-use objects, such as Response objects and Selector objects (for HTML and XML content), based on the downloaded page.

These objects are:

  • crawler-The current Crawler object.
  • spider-Spider that handles URLs. A spider that is not processed for the current URL is an Spider object.
  • request-The object that was recently fetched to the page Request . You can use to replace() modify the request. Or use fetch a shortcut to get a new request.
  • response-The object that contains the most recently acquired page Response .
  • sel-Objects built according to the recently acquired response Selector .
  • settings-Current Scrapy settings

Print the status code for the current request:

>>> response
<200 https://dig.chouti.com>

>>> response.headers
{b ' Date ': [B ' Thu, 09:45:06 GMT '], b ' Content-type ': [b ' text/html; Charset=utf-8 '], b ' Server ': [b ' Tengine '], B ' C Ontent-language ': [b ' en '], b ' X-via ': [B ' 1.1 bd157:10 (Cdn Ca
Che Server V2.0) '}

Try our XPath expression to extract the content

>>> Sel.xpath ('//a[@class = ' show-content color-chag ']/text () '). Extract_first () ' \n\t\t\t\t\t\t\t\t\t\t\t\ N\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\ t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tt\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t "Thunderbolt Awards maintenance Company benefits Employees Each prize 100,000 "November 30, Thunder and Thunder Big Data recently occurred" infighting ", the two sides issued a number of announcements accusing each other. In this, Thunderbolt issued internal mail, commendation at the critical moment to maintain the interests of the company's 5 employees, and give each person 100,000 yuan reward. \n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t ' >>> sel.xpath ('//a[@class = " Show-content Color-chag "]/text ()"). Extract_first (). Strip () "Thunderbolt Awards maintenance company benefits employees each prize 100,000" November 30, Thunder and Thunder Big Data recently occurred "infighting", The two sides issued several announcements of accusations against each other. In this, Thunderbolt issued internal mail, commendation at the critical moment to maintain the interests of the company's 5 employees, and give each person 100,000 yuan reward. ‘

  

You can also use CSS to extract

>>> sel.css ('. Part1 a::text '). Extract_first (). Strip () "Netflix buys" the overseas distribution rights of "White Nights", will be broadcast in more than 190 countries and regions '

  

The view is interesting, it is actually to save the downloaded HTML.

>>> View (response)
True

Print the URL of the current request

>>> Response.url
' Https://dig.chouti.com '

Python crawler scrapy scrapy terminal (scrapy Shell)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.