Scrapy-command line tools

Last Update:2018-07-28 Source: Internet

Author: User

Tags configuration settings macbook

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy is controlled by the Scrapy command-line tool, and its command-line tools provide a number of different commands for a variety of purposes, each with different parameters and options.

Some scrapy commands must be executed under the Scrapy project directory, and others can be executed in any directory. Commands that can be executed in any directory may be somewhat different if executed under the Scrapy project directory.

Scrapy Command Execution Environment

Global commands	Project-only commands
Startproject	Crawl
Genspider	Check
Settings	List
Runspider	Edit
Shell	Parse
Fetch	Bench
View
Version

1. Scrapy

Run the scrapy command line tool first, but without any commands, it outputs some usages and commands that can be used to the screen:

(scrapyenv) macbook-pro:~ $ scrapy scrapy 1.4.0-no active project Usage:   scrapy <command> [options] [args] availabl  E commands:   bench         run Quick benchmark test   fetch         fetch a URL using the Scrapy downloader   genspider     generate new spider using pre-defined t Emplates   runspider     run a self-contained spider (without creating a project)   SETTINGS&NBSP ;     Get settings values   shell         interactive scraping console   STARTP roject  Create New project   version       print scrapy version   view    &NB Sp     Open URL in Browser, as seen by Scrapy   [more]      More commands available when RU
 N From project directory use "Scrapy <command>-H" to the more info about a command (SCRAPYENV) macbook-pro:~ $

If the command is running in the directory of a Scrapy project, the first line displays the current item, and if it is not under any scrapy item, displays "No Active project"

Want to know more about commands: Scrapy <command>-H

(scrapyenv)


macbook-pro:myproject$ scrapy startproject-h Usage =====   scrapy startproject <project_name> [Project_dir]  Create New Project Options =======--help,-h              show the help message and Exit Global Options----------------logfile=file          log FILE. If omitted stderr'll be used--loglevel=level,-l level                   & nbsp     Log level (Default:debug)--nolog                 disable Logg ing completely--profile=file          Write Python cProfile stats to FILE--pidfile=file  & nbsp       Write process ID to FILE--set=name=value,-s name=value             &N Bsp           Set/override setting (May is repeated)--pdb            & nbsp     enable pdb on Failure (SCRAPYENV) Macbook-pro:myproject $

2. Scrapy startproject

Syntax:scrapy startproject <project_name> [Project_dir] Requires Project:no

Create a Scrapy project.

(scrapyenv) Macbook-pro:project $ scrapy startproject myproject [Project_dir]

A new Scrapy item will be created under the Project_dir directory. If Project_dir is not specified, the default directory is MyProject.

Then go to the directory of the new project and now we can use the Scrapy command to manage and control the new Scrapy project.

Here are two areas of knowledge:

2.1 Configuration Settings

The scrapy configuration is stored in the Scrapy.cfg file, which can appear in 3 places:

1./etc/scrapy.cfg or C:\scrapy\scrapy.cfg (System-level configuration)

2. ~/.config/scrapy.cfg ($XDG _config_home) and ~/.scrapy.cfg ($HOME) (user-level configuration)

3. scrapy.cfg in the root directory of the Scrapy project (project-level configuration)

These configurations are merged together and sorted in 3>2>1 order, that is, the priority of the >2 priority >1 of the 3 precedence level.

Scrapy can also be set through some environment variables:

Scrapy_settings_module Scrapy_project Scrapy_python_shell

2.2 Project Structure

A default basic structure for all scrapy projects is as follows:

.
| ____myproject
| |____items.py |
|____middlewares.py | |____pipelines.py | |____settings.py |
|_ ___spiders |
| |____spider1.py
| | | |____spider2.py
|____SCRAPY.CFG

The directory in which Scrapy.cfg is located is the root directory of the project. The file contains the name of the Python module that defines the project settings, such as

6 [Settings]
7 default = Myproject.settings

3. Scrapy Genspider

Syntax:scrapy Genspider [t template] <name> <domain> Requires project:no

Creates a new spider in the current directory or in the Spiders directory of the current project. <name> parameter is set spider name,<domain> used to generate spider properties: Allowed_domains and Start_urls.

(scrapyenv) Macbook-pro:scrapy $ scrapy genspider-l
Available templates:
basic
crawl
csvfeed xmlfeed (scrapyenv) Macbook-pro:scrapy $ scrapy Genspider example example.com Created '
spider ' Using example ' Basic '
( SCRAPYENV) Macbook-pro:scrapy $ scrapy genspider-t crawl scrapyorg scrapy.org Created spider
' scrapyorg ' using Templa Te ' crawl '
(scrapyenv) Macbook-pro:scrapy $

This command provides an easy way to create spider, and of course we can create our own spider source files.

4. Scrapy Crawl

Syntax:scrapy Crawl <spider> Requires Project:yes

Start crawling using a crawler spider.

(scrapyenv) Macbook-pro:project $ scrapy Crawl Myspider

5. Scrapy Check

Syntax:scrapy check [-l] <spider> Requires Project:yes

Run the check.

(scrapyenv) Macbook-pro:project $ scrapy check-l
(scrapyenv) macbook-pro:project $ scrapy Check


------------------------- ---------------------------------------------
Ran 0 contracts in 0.000s


OK
(scrapyenv) Macbook-pro: Project $

6. Scrapy List

Syntax:scrapy List Requires Project:yes

Lists all available spiders in the current project.

(scrapyenv) Macbook-pro:project $ scrapy List
toscrape-css
toscrape-xpath
(scrapyenv) Macbook-pro:project $

7. Edit

Syntax:scrapy edit <spider> Requires Project:yes opens the specified editor using the editor specified by the spider settings of the editor environment variable.

(scrapyenv) Macbook-pro:project $ scrapy Edit toscrape-css
(scrapyenv) Macbook-pro:project $

8. Fetch

Syntax:scrapy Fetch <url> Requires project:no

Use the Scrapy downloader to download the given URL and write the content to the standard output device.

It is noteworthy that it is in accordance with spider How to download the page to get the page, if Spider has a user_agent property then fetch will also use Spider user_agent as their own user_agent. If you use the fetch outside of the Scrapy project, no special crawler settings are applied to the default scrapy download settings.

This command supports 3 options:

--spider=spider: Ignores automatically detected spider, enforces the specified spider--headers: Output response's HTTP headers instead of response's body content--no-redirect: does not follow HTTP 3xx redirection (by default, with HTTP redirection)

(scrapyenv) Macbook-pro:project $ scrapy Fetch--nolog http://www.example.com/some/page.html
[... html content here ...]
(scrapyenv) Macbook-pro:project $ scrapy Fetch--nolog--headers http://www.example.com/
> accept:text/html,application/ xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-language:en
> user-agent:scrapy/1.4.0 (+http:// scrapy.org)
> Accept-encoding:gzip,deflate
>
< cache-control:max-age=604800
< Content-type:text/html
< date:wed, Oct 2017 13:55:57 GMT
< Etag: "359670651+gzip"
< expires:w Ed, Nov 2017 13:55:57 GMT
< Last-modified:fri, Aug 2013 23:54:35 GMT
< Server:ecs (oxr/839f)
< vary:accept-encoding
< X-cache:hit
(scrapyenv) Macbook-pro:project $

9. View

Syntax:scrapy View <url> Requires project:no

Opens the specified URL in the browser. Sometimes spider look at a page that is inconsistent with what the average user sees, so this command can be used to check if Spider sees the same as we thought.

Supported options:

--spider=spider: Force The specified spider--no-redirect: not redirected (the default is redirection)

(scrapyenv) Macbook-pro:project $ scrapy View http://www.163.com

Shell

syntax:scrapy shell [url] Requires project:no

Launches the Scrapy shell for the specified URL or does not specify a URL just to start the shell. Support for UNIX-style local file paths and relative paths./or ... /, the absolute file path is also supported.

Supported options:

--spider=spider: Enforces the specified spider-c code: asks for the value in the shell, prints the result, and exits the--no-redirect: not redirected (the default is redirection); This applies only to URLs that are passed as arguments in the command line. If you enter the Scrapy shell, and then use the Fetch (URL), the default redirects.

(scrapyenv) Macbook-pro:project $ scrapy Shell http://www.example.com/some/page.html
[... scrapy shell starts ...]
(scrapyenv) macbook-pro:project$ scrapy Shell--nolog http://www.example.com/-C ' (Response.Status, Response.url) '
(+, ' http ://www.example.com/')
(scrapyenv) macbook-pro:project $ scrapy Shell--nolog http://httpbin.org/redirect-to?url= Http%3a%2f%2fexample.com%2f-c ' (Response.Status, Response.url) ' (+
, ' http://example.com/')
(scrapyenv) Macbook-pro:project $ scrapy Shell--no-redirect--nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com %2f-c ' (Response.Status, Response.url) '
(302, ' http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com% 2F ')
(scrapyenv) Macbook-pro:project $

Parse.

Syntax:scrapy parse <url> [options] Requires Project:yes

Gets the page of the given URL and uses the spider parsing that handles the URL, using the method specified by--callback, and parsing with the default parse method if unspecified.

Supported options

--spider=spider: Enforces the use of the specified spider--a name=value: Sets the spider parameter (possibly duplicate)--callback or-c:spider callback method to parse the response- Pipelines: Pipelines processing Items--rules or-r: using Crqslspider rules to discover callback methods to resolve response--noitems: Do not show crawled items- Nolinks: Do not show links obtained--nocolour: Avoid using pygments for output shading--depth or-d:request request recursion depth (default 1)--verbose or-V: Display debug information

Settings

syntax:scrapy settings [options] Requires Project:no

Gets a value for the Scrapy setting.

If used under the project, the configuration of the project is displayed, otherwise the default Scrapy setting is displayed.

(scrapyenv) Macbook-pro:project $ scrapy Settings--get bot_name
quotesbot
(scrapyenv) Macbook-pro:project $ scrapy Settings--get download_delay
0
(scrapyenv) Macbook-pro:project $

Runspider

Syntax:scrapy Runspider <spider_file.py> Requires project:no

Running a crawler that is contained in a PY file does not require creating a project.

$ scrapy Runspider myspider.py

version

Syntax:scrapy version [-v] Requires Project:no

Print the Scrapy version. If used with-V, the python,twisted and system information will also be printed.

Bench

Syntax:scrapy Bench Requires Project:no

Run a basic test.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More