Scrapy is controlled by the Scrapy command-line tool, and its command-line tools provide a number of different commands for a variety of purposes, each with different parameters and options.
Some scrapy commands must be executed under the Scrapy project directory, and others can be executed in any directory. Commands that can be executed in any directory may be somewhat different if executed under the Scrapy project directory.
Scrapy Command Execution Environment
Global commands |
Project-only commands |
Startproject |
Crawl |
Genspider |
Check |
Settings |
List |
Runspider |
Edit |
Shell |
Parse |
Fetch |
Bench |
View |
|
Version |
|
1. Scrapy
Run the scrapy command line tool first, but without any commands, it outputs some usages and commands that can be used to the screen:
(scrapyenv) macbook-pro:~ $ scrapy scrapy 1.4.0-no active project Usage: scrapy <command> [options] [args] availabl E commands: bench run Quick benchmark test fetch fetch a URL using the Scrapy downloader genspider generate new spider using pre-defined t Emplates runspider run a self-contained spider (without creating a project) SETTINGS&NBSP ; Get settings values shell interactive scraping console STARTP roject Create New project version print scrapy version view &NB Sp Open URL in Browser, as seen by Scrapy [more] More commands available when RU
N From project directory use "Scrapy <command>-H" to the more info about a command (SCRAPYENV) macbook-pro:~ $
If the command is running in the directory of a Scrapy project, the first line displays the current item, and if it is not under any scrapy item, displays "No Active project"
Want to know more about commands: Scrapy <command>-H
(scrapyenv)
macbook-pro:myproject$ scrapy startproject-h Usage ===== scrapy startproject <project_name> [Project_dir] Create New Project Options =======--help,-h show the help message and Exit Global Options----------------logfile=file log FILE. If omitted stderr'll be used--loglevel=level,-l level & nbsp Log level (Default:debug)--nolog disable Logg ing completely--profile=file Write Python cProfile stats to FILE--pidfile=file & nbsp Write process ID to FILE--set=name=value,-s name=value &N Bsp Set/override setting (May is repeated)--pdb & nbsp enable pdb on Failure (SCRAPYENV) Macbook-pro:myproject $
2. Scrapy startproject
Syntax:scrapy startproject <project_name> [Project_dir] Requires Project:no
Create a Scrapy project.
(scrapyenv) Macbook-pro:project $ scrapy startproject myproject [Project_dir]
A new Scrapy item will be created under the Project_dir directory. If Project_dir is not specified, the default directory is MyProject.
Then go to the directory of the new project and now we can use the Scrapy command to manage and control the new Scrapy project.
Here are two areas of knowledge:
2.1 Configuration Settings
The scrapy configuration is stored in the Scrapy.cfg file, which can appear in 3 places:
1./etc/scrapy.cfg or C:\scrapy\scrapy.cfg (System-level configuration)
2. ~/.config/scrapy.cfg ($XDG _config_home) and ~/.scrapy.cfg ($HOME) (user-level configuration)
3. scrapy.cfg in the root directory of the Scrapy project (project-level configuration)
These configurations are merged together and sorted in 3>2>1 order, that is, the priority of the >2 priority >1 of the 3 precedence level.
Scrapy can also be set through some environment variables:
Scrapy_settings_module Scrapy_project Scrapy_python_shell
2.2 Project Structure
A default basic structure for all scrapy projects is as follows:
.
| ____myproject
| |____items.py |
|____middlewares.py | |____pipelines.py | |____settings.py |
|_ ___spiders |
| |____spider1.py
| | | |____spider2.py
|____SCRAPY.CFG
The directory in which Scrapy.cfg is located is the root directory of the project. The file contains the name of the Python module that defines the project settings, such as
6 [Settings]
7 default = Myproject.settings
3. Scrapy Genspider
Syntax:scrapy Genspider [t template] <name> <domain> Requires project:no
Creates a new spider in the current directory or in the Spiders directory of the current project. <name> parameter is set spider name,<domain> used to generate spider properties: Allowed_domains and Start_urls.
(scrapyenv) Macbook-pro:scrapy $ scrapy genspider-l
Available templates:
basic
crawl
csvfeed xmlfeed (scrapyenv) Macbook-pro:scrapy $ scrapy Genspider example example.com Created '
spider ' Using example ' Basic '
( SCRAPYENV) Macbook-pro:scrapy $ scrapy genspider-t crawl scrapyorg scrapy.org Created spider
' scrapyorg ' using Templa Te ' crawl '
(scrapyenv) Macbook-pro:scrapy $
This command provides an easy way to create spider, and of course we can create our own spider source files.
4. Scrapy Crawl
Syntax:scrapy Crawl <spider> Requires Project:yes
Start crawling using a crawler spider.
(scrapyenv) Macbook-pro:project $ scrapy Crawl Myspider
5. Scrapy Check
Syntax:scrapy check [-l] <spider> Requires Project:yes
Run the check.
(scrapyenv) Macbook-pro:project $ scrapy check-l
(scrapyenv) macbook-pro:project $ scrapy Check
------------------------- ---------------------------------------------
Ran 0 contracts in 0.000s
OK
(scrapyenv) Macbook-pro: Project $
6. Scrapy List
Syntax:scrapy List Requires Project:yes
Lists all available spiders in the current project.
(scrapyenv) Macbook-pro:project $ scrapy List
toscrape-css
toscrape-xpath
(scrapyenv) Macbook-pro:project $
7. Edit
Syntax:scrapy edit <spider> Requires Project:yes opens the specified editor using the editor specified by the spider settings of the editor environment variable.
(scrapyenv) Macbook-pro:project $ scrapy Edit toscrape-css
(scrapyenv) Macbook-pro:project $
8. Fetch
Syntax:scrapy Fetch <url> Requires project:no
Use the Scrapy downloader to download the given URL and write the content to the standard output device.
It is noteworthy that it is in accordance with spider How to download the page to get the page, if Spider has a user_agent property then fetch will also use Spider user_agent as their own user_agent. If you use the fetch outside of the Scrapy project, no special crawler settings are applied to the default scrapy download settings.
This command supports 3 options:
--spider=spider: Ignores automatically detected spider, enforces the specified spider--headers: Output response's HTTP headers instead of response's body content--no-redirect: does not follow HTTP 3xx redirection (by default, with HTTP redirection)
(scrapyenv) Macbook-pro:project $ scrapy Fetch--nolog http://www.example.com/some/page.html
[... html content here ...]
(scrapyenv) Macbook-pro:project $ scrapy Fetch--nolog--headers http://www.example.com/
> accept:text/html,application/ xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-language:en
> user-agent:scrapy/1.4.0 (+http:// scrapy.org)
> Accept-encoding:gzip,deflate
>
< cache-control:max-age=604800
< Content-type:text/html
< date:wed, Oct 2017 13:55:57 GMT
< Etag: "359670651+gzip"
< expires:w Ed, Nov 2017 13:55:57 GMT
< Last-modified:fri, Aug 2013 23:54:35 GMT
< Server:ecs (oxr/839f)
< vary:accept-encoding
< X-cache:hit
(scrapyenv) Macbook-pro:project $
9. View
Syntax:scrapy View <url> Requires project:no
Opens the specified URL in the browser. Sometimes spider look at a page that is inconsistent with what the average user sees, so this command can be used to check if Spider sees the same as we thought.
Supported options:
--spider=spider: Force The specified spider--no-redirect: not redirected (the default is redirection)
(scrapyenv) Macbook-pro:project $ scrapy View http://www.163.com
Shell
syntax:scrapy shell [url] Requires project:no
Launches the Scrapy shell for the specified URL or does not specify a URL just to start the shell. Support for UNIX-style local file paths and relative paths./or ... /, the absolute file path is also supported.
Supported options:
--spider=spider: Enforces the specified spider-c code: asks for the value in the shell, prints the result, and exits the--no-redirect: not redirected (the default is redirection); This applies only to URLs that are passed as arguments in the command line. If you enter the Scrapy shell, and then use the Fetch (URL), the default redirects.
(scrapyenv) Macbook-pro:project $ scrapy Shell http://www.example.com/some/page.html
[... scrapy shell starts ...]
(scrapyenv) macbook-pro:project$ scrapy Shell--nolog http://www.example.com/-C ' (Response.Status, Response.url) '
(+, ' http ://www.example.com/')
(scrapyenv) macbook-pro:project $ scrapy Shell--nolog http://httpbin.org/redirect-to?url= Http%3a%2f%2fexample.com%2f-c ' (Response.Status, Response.url) ' (+
, ' http://example.com/')
(scrapyenv) Macbook-pro:project $ scrapy Shell--no-redirect--nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com %2f-c ' (Response.Status, Response.url) '
(302, ' http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com% 2F ')
(scrapyenv) Macbook-pro:project $
Parse.
Syntax:scrapy parse <url> [options] Requires Project:yes
Gets the page of the given URL and uses the spider parsing that handles the URL, using the method specified by--callback, and parsing with the default parse method if unspecified.
Supported options
--spider=spider: Enforces the use of the specified spider--a name=value: Sets the spider parameter (possibly duplicate)--callback or-c:spider callback method to parse the response- Pipelines: Pipelines processing Items--rules or-r: using Crqslspider rules to discover callback methods to resolve response--noitems: Do not show crawled items- Nolinks: Do not show links obtained--nocolour: Avoid using pygments for output shading--depth or-d:request request recursion depth (default 1)--verbose or-V: Display debug information
Settings
syntax:scrapy settings [options] Requires Project:no
Gets a value for the Scrapy setting.
If used under the project, the configuration of the project is displayed, otherwise the default Scrapy setting is displayed.
(scrapyenv) Macbook-pro:project $ scrapy Settings--get bot_name
quotesbot
(scrapyenv) Macbook-pro:project $ scrapy Settings--get download_delay
0
(scrapyenv) Macbook-pro:project $
Runspider
Syntax:scrapy Runspider <spider_file.py> Requires project:no
Running a crawler that is contained in a PY file does not require creating a project.
$ scrapy Runspider myspider.py
version
Syntax:scrapy version [-v] Requires Project:no
Print the Scrapy version. If used with-V, the python,twisted and system information will also be printed.
Bench
Syntax:scrapy Bench Requires Project:no
Run a basic test.