Combine Scrapy official documentation to learn and organize some of the content of your own learning practices
Scrapy is scrapy
controlled by command-line tools. This is what we call the "scrapy tool" to distinguish it from subcommands. For subcommands, we call this "command" or "Scrapy commands".
The Scrapy tool provides multiple commands for different purposes, and each command supports different parameters and options.
Default Scrapy Project Structure
Before beginning the exploration of command-line tools and subcommands, let's first look at the directory structure of the Scrapy project.
Although it can be modified, all scrapy projects default to a file structure similar to the following:
scrapy.cfgmyproject/ __init__. py items.py pipelines.py settings.py Spiders/ __init__. py spider1.py spider2.py ...
scrapy.cfg
the directory that is stored is considered the root directory of the project . The field that contains the Python module name in the file defines the settings for the project. For example:
= Myproject.settings
Use
scrapy
Tools
You can start the Scrapy tool in a non-parametric manner. This command will give you some help with the use and the commands available:
Scrapy x.y- no active projectusage: <command> [Options] [args]available commands: crawl Run a spider fetch fetch a URL using the Scrapy downloader[...]
If you are running in a Scrapy project, the currently active project will be displayed in the first line of the output. The above output is an example of a response. If you run a command in a project, you will get a similar output:
Scrapy x.y- project:myprojectusage: <command> [options] [args][...]
Create a project
In general, the scrapy
first thing to use a tool is to create your Scrapy project:
Scrapy Startproject Scrapytest
The command will create a scrapy project in the scrapytest directory.
Next, go to the project directory:
CD Scrapytest
you can then use scrapy
command to manage and control your project.
Simply put, the terminal in Pycharm also supports Windows command operations.
The created project structure is as follows, and the spider is not created at this time
Control project
You can use tools in your projects to scrapy
control and manage them.
For example, create a new spider from the command line or terminal under the project:
Scrapy Genspider Spiderdemo Domaintest
The directory structure after creation is as follows
The automatically generated code is as follows
some scrapy commands (such as crawl
requirements must be run in the Scrapy project. You can use the commands reference below to see which commands need to be run in the project and which do not.
Also note that some commands are somewhat different when running in a project. Take the FETCH command, for example, if the crawled URL is associated with a particular spider, the command will use the Spider's action (Spider-overridden behaviours). (as specified by the Spider user_agent
). The performance was intentional. In general, the fetch
command is used to test how the spider downloads the page.
Available Tools commands (tool commands)
This section provides a list of the available built-in commands. Each command provides a description and some examples of use. You can always get detailed information about each command by running the command:
Scrapy
Scrapy <command>-H
You can also view the available commands in this way:
Scrapy-h
Scrapy provides two types of commands. A command that must be run in a scrapy project (for a project (project-specific)), and one that is not required (global command). Global commands may behave differently when running in a project than in a non-project (because the project's settings may be used).
Global command:
startproject
settings
runspider
shell
fetch
view
version
Project (project-only) command:
crawl
check
list
edit
parse
genspider
deploy
bench
Startproject
- Grammar:
scrapy startproject <project_name>
- Item Required: no
project_name
under folder, create a project_name
scrapy project named.
Example:
Scrapy Startproject MyProject
Genspider
- Grammar:
scrapy genspider [-t template] <name> <domain>
- Project required: Yes
Creates a spider in the current project.
This is just a quick way to create a spider. This method can be used to create a spider using a template that is defined in advance. You can also create your own spider's source files.
Example:
# use the command parameter-l to view the spider template -l# The output is the template name Available templates: Basic crawl csvfeed xmlfeed
View Spider template Source code
#Use the command plus the parameter '-d ', and the template namescrapy Genspider-D Crawl#Output Template Information#-*-coding:utf-8-*-Importscrapy fromScrapy.linkextractorsImportLinkextractor fromScrapy.spidersImportCrawlspider, Ruleclass$classname (crawlspider): Name='$name'Allowed_domains= ['$domain'] Start_urls= ['/HTTP $domain/'] Rules=(Rule (Linkextractor ( allow=r'items/'), callback='Parse_item', follow=True),)defParse_item (Self, Response): I= {} #i[' domain_id ' = Response.xpath ('//input[@id = "Sid"]/@value '). Extract () #i[' name '] = Response.xpath ('//div[@id = "Name"]. Extract () #i[' description ' = Response.xpath ('//div[@id = "description"]). Extract () returnI
# Create Spider-'example by command plus parameter-T template name spidername [DomainName] ' ' Basic ' inch module: mybot.spiders.example
Crawl
- Grammar:
scrapy crawl <spider>
- Project required: Yes
Use spiders for crawling.
Example:
$ scrapy Crawl myspider[... myspider starts crawling ...]
Check
- Grammar:
scrapy check [-l] <spider>
- Project required: Yes
Run the contract check.
Example:
$ scrapy Check-lfirst_spider * Parse * Parse_itemsecond_spider * Parse * parse_item$ scrapy check[ FAILED] first_spider:parse_item>>> ' Retailpricex ' field is missing[failed] First_spider:parse>>> Returned requests, expected 0..4
List
- Grammar:
scrapy list
- Project required: Yes
Lists all the available spiders in the current project. Each row outputs a spider.
Examples of Use:
$ scrapy Listspider1spider2
Edit
- Grammar:
scrapy edit <spider>
- Project required: Yes
EDITOR
edit the given spider using the editor set in
The command simply provides a shortcut. Developers are free to choose other tools or ides to write debug spiders.
Example:
$ scrapy Edit Spider1
Fetch
- Grammar:
scrapy fetch <url>
- Item Required: no
Download the given URL using the Scrapy Downloader (downloader) and send the captured content to standard output.
This command gets the page in the way the Spider downloads the page. For example, if the spider has USER_AGENT
properties that modify the User Agent, the command will use that property.
Therefore, you can use this command to see how the spider obtains a particular page.
This command will use the default Scrapy downloader setting if it is running in a non-project.
Example:
$ scrapy Fetch--nolog http://www.example.com/some/page.html[... html content here ...] $ scrapy Fetch--nolog--headers http://www.example.com/{' accept-ranges ': [' bytes '], ' age ': [' 1263 '], ' Connection ' : [' Close '], ' content-length ': [' 596 '], ' content-type ': [' text/html; Charset=utf-8 '], ' Date ': [' Wed, 2010 23:5 9:46 GMT '], ' Etag ': [' "573c1-254-48c9c87349680" '], ' last-modified ': [' Fri, ' 15:30:18 GMT '], ' Server ': [' apache/ 2.2.3 (CentOS) '}
View
- Grammar:
scrapy view <url>
- Item Required: no
Opens the given URL in the browser and displays it in the form acquired by the Scrapy Spider. Sometimes spiders get pages that are not the same as ordinary users see. So this command can be used to check the page the spider gets to and confirm that this is what you expect.
Example:
$ scrapy View http://www.example.com/some/page.html[... browser starts ...]
Shell
- Grammar:
scrapy shell [url]
- Item Required: no
Launches the Scrapy shell with the given URL (if given) or empty (no URL given). Check out the Scrapy terminal (Scrapy shell) for more information.
Example:
$ scrapy Shell http://www.example.com/some/page.html[... scrapy shell starts ...]
Parse
- Grammar:
scrapy parse <url> [options]
- Project required: Yes
Gets the given URL and uses the appropriate spider analysis processing. If you provide --callback
options, use this method of the spider to handle it, otherwise use parse
.
Supported options:
--spider=SPIDER
: Skips automatic detection of spiders and enforces the use of specific spiders
--a NAME=VALUE
: Sets the parameters of the spider (may be duplicated)
--callback
or -c
: The callback function used to parse the return (response) in the spider
--pipelines
: Handling Item in pipeline
--rules
or -r
: Use CrawlSpider
a rule to discover the callback function used to parse the return (response)
--noitems
: Do not show crawled to item
--nolinks
: Do not display the extracted links
--nocolour
: Avoid using pygments to color the output
--depth
or -d
: Specify the number of levels of follow-up link requests (default: 1)
--verbose
or -v
: Show details for each request
Example:
$ scrapy Parse http://www.example.com/-c parse_item[... scrapy log lines crawling example.com spider ...] >>> STATUS DEPTH Level 1 <<<# scraped Items --------------------------------------------------- ---------[{' Name ': U ' Example item ', ' Category ': U ' furniture ', ' length ': U ' a '-CM '}]# requests --------------------- --------------------------------------------[]
Settings
- Grammar:
scrapy settings [options]
- Item Required: no
Get the Scrapy settings
When running in a project, the command outputs the project's setpoint, otherwise the output scrapy the default setting.
Example:
$ scrapy Settings--get bot_namescrapybot$ scrapy settings--get download_delay0
Runspider
- Grammar:
scrapy runspider <spider_file.py>
- Item Required: no
Run a spider that is written in a Python file without creating a project.
Example:
$ scrapy runspider myspider.py[... spider starts crawling ...]
Version
- Grammar:
scrapy version [-v]
- Item Required: no
Output scrapy version. In conjunction with -v
the runtime, the command outputs Python, twisted, and platform information to facilitate bug submissions.
Deploy
0.11 new features.
- Grammar:
scrapy deploy [ <target:project> | -l <target> | -L ]
- Project required: Yes
Deploy the project to the Scrapyd service. View the deployment of your project.
Bench
0.17 new features.
- Grammar:
scrapy bench
- Item Required: no
Run the benchmark test. Benchmarking.
Custom Project Commands
You can also COMMANDS_MODULE
add your own project commands by adding them. You can learn how to implement your commands in Scrapy/commands scrapy commands as an example.
Commands_module
Default: ‘‘
(empty string)
The module used to find the Add Custom scrapy command.
Example:
' Mybot.commands '
Python--scrapy command line tools