Python--scrapy command line tools

Source: Internet
Author: User
Tags xpath

Combine Scrapy official documentation to learn and organize some of the content of your own learning practices

Scrapy is scrapy controlled by command-line tools. This is what we call the "scrapy tool" to distinguish it from subcommands. For subcommands, we call this "command" or "Scrapy commands".

The Scrapy tool provides multiple commands for different purposes, and each command supports different parameters and options.

Default Scrapy Project Structure

Before beginning the exploration of command-line tools and subcommands, let's first look at the directory structure of the Scrapy project.

Although it can be modified, all scrapy projects default to a file structure similar to the following:

scrapy.cfgmyproject/    __init__. py    items.py    pipelines.py    settings.py    Spiders/        __init__. py        spider1.py        spider2.py        ...

scrapy.cfgthe directory that is stored is considered the root directory of the project . The field that contains the Python module name in the file defines the settings for the project. For example:

= Myproject.settings
Use scrapyTools

You can start the Scrapy tool in a non-parametric manner. This command will give you some help with the use and the commands available:

Scrapy x.y- no active projectusage:  <command> [Options] [args]available commands:  crawl         Run a spider  fetch         fetch a URL using the Scrapy downloader[...]

If you are running in a Scrapy project, the currently active project will be displayed in the first line of the output. The above output is an example of a response. If you run a command in a project, you will get a similar output:

Scrapy x.y- project:myprojectusage:  <command> [options] [args][...]

Create a project

In general, the scrapy first thing to use a tool is to create your Scrapy project:

Scrapy Startproject Scrapytest

The command will create a scrapy project in the scrapytest directory.

Next, go to the project directory:

CD Scrapytest

you can then use scrapy command to manage and control your project.

Simply put, the terminal in Pycharm also supports Windows command operations.

The created project structure is as follows, and the spider is not created at this time

Control project

You can use tools in your projects to scrapy control and manage them.

For example, create a new spider from the command line or terminal under the project:

Scrapy Genspider Spiderdemo Domaintest

The directory structure after creation is as follows

The automatically generated code is as follows

some scrapy commands (such as crawl requirements must be run in the Scrapy project. You can use the commands reference below to see which commands need to be run in the project and which do not.

Also note that some commands are somewhat different when running in a project. Take the FETCH command, for example, if the crawled URL is associated with a particular spider, the command will use the Spider's action (Spider-overridden behaviours). (as specified by the Spider user_agent ). The performance was intentional. In general, the fetch command is used to test how the spider downloads the page.

Available Tools commands (tool commands)

This section provides a list of the available built-in commands. Each command provides a description and some examples of use. You can always get detailed information about each command by running the command:

Scrapy

Scrapy <command>-H

You can also view the available commands in this way:

Scrapy-h

Scrapy provides two types of commands. A command that must be run in a scrapy project (for a project (project-specific)), and one that is not required (global command). Global commands may behave differently when running in a project than in a non-project (because the project's settings may be used).

Global command:

    • startproject
    • settings
    • runspider
    • shell
    • fetch
    • view
    • version

Project (project-only) command:

    • crawl
    • check
    • list
    • edit
    • parse
    • genspider
    • deploy
    • bench

Startproject
    • Grammar:scrapy startproject <project_name>
    • Item Required: no

project_nameunder folder, create a project_name scrapy project named.

Example:

Scrapy Startproject MyProject

Genspider

    • Grammar:scrapy genspider [-t template] <name> <domain>
    • Project required: Yes

Creates a spider in the current project.

This is just a quick way to create a spider. This method can be used to create a spider using a template that is defined in advance. You can also create your own spider's source files.

Example:

# use the command parameter-l to  view the spider template  -l# The output is the template name Available templates:  Basic  crawl  csvfeed  xmlfeed

View Spider template Source code

#Use the command plus the parameter '-d ', and the template namescrapy Genspider-D Crawl#Output Template Information#-*-coding:utf-8-*-Importscrapy fromScrapy.linkextractorsImportLinkextractor fromScrapy.spidersImportCrawlspider, Ruleclass$classname (crawlspider): Name='$name'Allowed_domains= ['$domain'] Start_urls= ['/HTTP $domain/'] Rules=(Rule (Linkextractor ( allow=r'items/'), callback='Parse_item', follow=True),)defParse_item (Self, Response): I= {}        #i[' domain_id ' = Response.xpath ('//input[@id = "Sid"]/@value '). Extract ()        #i[' name '] = Response.xpath ('//div[@id = "Name"]. Extract ()        #i[' description ' = Response.xpath ('//div[@id = "description"]). Extract ()        returnI

# Create Spider-'example by command plus parameter-T template name  spidername [DomainName]   ' ' Basic ' inch module:  mybot.spiders.example
Crawl
    • Grammar:scrapy crawl <spider>
    • Project required: Yes

Use spiders for crawling.

Example:

$ scrapy Crawl myspider[... myspider starts crawling ...]
Check
    • Grammar:scrapy check [-l] <spider>
    • Project required: Yes

Run the contract check.

Example:

$ scrapy Check-lfirst_spider *  Parse  * Parse_itemsecond_spider  * Parse  * parse_item$ scrapy check[ FAILED] first_spider:parse_item>>> ' Retailpricex ' field is missing[failed] First_spider:parse>>> Returned requests, expected 0..4
List
    • Grammar:scrapy list
    • Project required: Yes

Lists all the available spiders in the current project. Each row outputs a spider.

Examples of Use:

$ scrapy Listspider1spider2
Edit
    • Grammar:scrapy edit <spider>
    • Project required: Yes

EDITORedit the given spider using the editor set in

The command simply provides a shortcut. Developers are free to choose other tools or ides to write debug spiders.

Example:

$ scrapy Edit Spider1
Fetch
    • Grammar:scrapy fetch <url>
    • Item Required: no

Download the given URL using the Scrapy Downloader (downloader) and send the captured content to standard output.

This command gets the page in the way the Spider downloads the page. For example, if the spider has USER_AGENT properties that modify the User Agent, the command will use that property.

Therefore, you can use this command to see how the spider obtains a particular page.

This command will use the default Scrapy downloader setting if it is running in a non-project.

Example:

$ scrapy Fetch--nolog http://www.example.com/some/page.html[... html content here ...] $ scrapy Fetch--nolog--headers http://www.example.com/{' accept-ranges ': [' bytes '], ' age ': [' 1263   '], ' Connection ' : [' Close     '], ' content-length ': [' 596 '], ' content-type ': [' text/html; Charset=utf-8 '], ' Date ': [' Wed, 2010 23:5 9:46 GMT '], ' Etag ': [' "573c1-254-48c9c87349680" '], ' last-modified ': [' Fri, ' 15:30:18 GMT '], ' Server ': [' apache/ 2.2.3 (CentOS) '}
View
    • Grammar:scrapy view <url>
    • Item Required: no

Opens the given URL in the browser and displays it in the form acquired by the Scrapy Spider. Sometimes spiders get pages that are not the same as ordinary users see. So this command can be used to check the page the spider gets to and confirm that this is what you expect.

Example:

$ scrapy View http://www.example.com/some/page.html[... browser starts ...]
Shell
    • Grammar:scrapy shell [url]
    • Item Required: no

Launches the Scrapy shell with the given URL (if given) or empty (no URL given). Check out the Scrapy terminal (Scrapy shell) for more information.

Example:

$ scrapy Shell http://www.example.com/some/page.html[... scrapy shell starts ...]
Parse
    • Grammar:scrapy parse <url> [options]
    • Project required: Yes

Gets the given URL and uses the appropriate spider analysis processing. If you provide --callback options, use this method of the spider to handle it, otherwise use parse .

Supported options:

  • --spider=SPIDER: Skips automatic detection of spiders and enforces the use of specific spiders
  • --a NAME=VALUE: Sets the parameters of the spider (may be duplicated)
  • --callbackor -c : The callback function used to parse the return (response) in the spider
  • --pipelines: Handling Item in pipeline
  • --rulesor -r : Use CrawlSpider a rule to discover the callback function used to parse the return (response)
  • --noitems: Do not show crawled to item
  • --nolinks: Do not display the extracted links
  • --nocolour: Avoid using pygments to color the output
  • --depthor -d : Specify the number of levels of follow-up link requests (default: 1)
  • --verboseor -v : Show details for each request

Example:

$ scrapy Parse http://www.example.com/-c parse_item[... scrapy log lines crawling example.com spider ...] >>> STATUS DEPTH Level 1 <<<# scraped Items  --------------------------------------------------- ---------[{' Name ': U ' Example item ', ' Category ': U ' furniture ', ' length ': U ' a '-CM '}]# requests  --------------------- --------------------------------------------[]
Settings
    • Grammar:scrapy settings [options]
    • Item Required: no

Get the Scrapy settings

When running in a project, the command outputs the project's setpoint, otherwise the output scrapy the default setting.

Example:

$ scrapy Settings--get bot_namescrapybot$ scrapy settings--get download_delay0
Runspider
    • Grammar:scrapy runspider <spider_file.py>
    • Item Required: no

Run a spider that is written in a Python file without creating a project.

Example:

$ scrapy runspider myspider.py[... spider starts crawling ...]
Version
    • Grammar:scrapy version [-v]
    • Item Required: no

Output scrapy version. In conjunction with -v the runtime, the command outputs Python, twisted, and platform information to facilitate bug submissions.

Deploy

0.11 new features.

    • Grammar:scrapy deploy [ <target:project> | -l <target> | -L ]
    • Project required: Yes

Deploy the project to the Scrapyd service. View the deployment of your project.

Bench

0.17 new features.

    • Grammar:scrapy bench
    • Item Required: no

Run the benchmark test. Benchmarking.

Custom Project Commands

You can also COMMANDS_MODULE add your own project commands by adding them. You can learn how to implement your commands in Scrapy/commands scrapy commands as an example.

Commands_module

Default: ‘‘ (empty string)

The module used to find the Add Custom scrapy command.

Example:

' Mybot.commands '

Python--scrapy command line tools

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.