Scrapy's shell command

Source: Internet
Author: User

    • Scrapy
    • Python

MrzontPublished on August 29, 2015

  • Recommended1Recommended
  • Collection theCollection3kBrowse

0. Basic Environment Description
    1. This article and the operating environment are implemented on the WIN8 (yes, I have a holiday home, the home machine is Win8 no way), but the basic steps and Win 7 environment basically the same. PS: I later changed the computer win7, so the environment of this article is Win8 and Win7 of the combination of ~, wow haha ~, but this does not have any eggs ~

    2. Scrapy version is 1.0.3

    3. This article is basically a very shameless translation of the official documents by the way add a bit of their own understanding

    4. References and download links:

      • Official 1.0.3 English Document download

    5. Like the point of recommendation + collection to ~ so I will know what I have written, wow haha ~

    6. I use the operating system is 64-bit, but python with 32-bit, because 32-bit package is full, but I recently migrated to 64-bit server, then there are problems I will be added in time.

1. Scrapy shell commands in detail 1.1 commands overview

You can first view all the Scrapy available command types by using the following command:

scrapy -h

Scrapy current commands can be divided into Project command and Global command two categories, a total of 14 (well, I seriously count two times), the distribution is extremely symmetrical, the project-level command 7 Global Command 7 (well, I counted it again seriously). respectively:

Global command

    • Startproject

    • Settings

    • Runspider

    • Shell

    • Fetch

    • View

    • Version

Project commands

    • Crawl

    • Check

    • List

    • Edit

    • Parse

    • Genspider

    • Bench

Wow haha, then we began to learn these commands, some of the commands to use the OH ~

1.2 Global Command Resolution 1.2.1 Startproject Create Project command
    • Basic syntax:scrapy startproject <project_name>

    • Whether the project needs to exist: of course not.

In fact, this command we have used in the previous tutorial, should not be unfamiliar, is to create a named project_name Crawler framework for us to teach (wretched face). Is the first step to create a reptile project.

Case

scrapy startproject njupt #嗯,我无耻的修改了官方的例子,在运行命令行的目录下面创建了一个名为njupt的文件夹,里面存放了一个名为njupt的scrapy框架
1.2.2 Settings View configuration file Parameters command
    • Basic syntax:scrapy settings [options]

    • Whether the project needs to exist: Of course, no need to ~ (this is the official writing, seemingly a bit of a problem)

Well, this command is used to view the project parameters. Official documents said no project, after I test, if in the project directory, will show the project contents of the settings.py corresponding project, if the actual project directory, it seems to return a scrapy default value, I personally recommend that you use this command in the project directory to view settings.py The contents are OK.

Case

scrapy settings --get BOT_NAME
1.2.3 Runspider Run crawler command
    • Basic syntax:scrapy runspider <spider_file.py>

    • Whether the project needs to exist: of course not.

It seems that this command is a command that scrapy runs a crawler without relying on a single project. Weak weak to say, this is not settings without pipelines the crawler really good

Case

scrapy runspider njupt.py
1.2.4 Shell creates a shell environment for debugging response commands (very important!!!). )
    • Basic syntax:scrapy shell [url]

    • Whether the project needs to exist: of course not.

This command is really important, the main purpose is to create a shell environment for debugging Response command (well, and the title is exactly the same), because she is too important, so I decided to come back specifically to write an article to introduce this command, eager to go through the official documents of the classmate ~ (actually I was tired, Because I am on the side of the code to reload the computer and now also more than 12 in the middle of the night ~ agreed to go to bed and fall in the back)

Case:

scrapy shell http://www.njupt.edu.cn
1.2.5 fetch shows the crawl process
    • Basic syntax:scrapy fetch [url]

    • Whether the project needs to exist: it looks like it's all right.

This command actually shows the entire process of invoking a crawler to crawl the specified URL in the standard output. One thing to note is that if you use this command in a project directory, the crawler in this project will be called by default, and if you use this command outside of the project directory, the scrapy default crawler will be called to crawl. So there are no projects available to run.

Case:

scrapy fetch http://www.njupt.edu.cn #会显示抓取的过程,以及抓取的html的内容
--nolog --headers http://www.njupt.edu.com/#可以添加一些参数来,观察各种信息,如添加个:--nolog 可以忽略很多烦人的日志--headers 用来查看请求时的头部信息
1.2.6 View page Content command
    • Basic syntax:scrapy view [url]

    • Whether a project exists: no project exists

When you can't extract some information, you should consider using this view , this command is for you to download a page and open the browser , in order to let you compare the Scrapy "see" Page and the page you see through the browser is different, This is useful for some dynamically generated Web pages! but there's a hole here . I am in the Win7 environment, the use of this command when downloading the corresponding Web page with the browser open (Visual pure command line GUI-free Linux because it does not automatically open the browser), when the Ajax because of the browser opened, is executed again, so the page opened by the browser should not be the same as the page you normally visit, but it is not really scrapy to see the page. How to see inside a real page? Very simple, find the original address with sublime Open is ~. The original address is in the address bar of the browser Oh ~

Case:

scrapy view http://item.jd.com/1319191.html#嗯,我背叛了大njupt,因为我们学校的网站太牛X没有使用ajax,所以我就用大JD做例子了。
1.2.7 version Display release information
    • Basic syntax:scrapy version [-v]

    • Whether a project exists: no project exists

This command is simple, is to display the version of Scrapy, if you add a -v command will also show Python, twisted and platform information, seemingly for bug search and report very helpful!

1.3 Project Command parsing 1.3.1 Genspider creating crawlers from Templates
    • Basic syntax:scrapy genspider [-t template] <name> <domain>

    • Whether the project needs to exist: Project order, decisive need ~

This command is mainly to help us in the writing of multiple crawlers, the use of existing crawlers to quickly generate new crawlers, of course, this is not the only way to create new reptiles, Sao years, not too tired words can re-knock a ~

Case:

scrapy genspider -l

-lto view existing crawler templates by adding parameters

scrapy genspider -d basic 

By adding parameters -d and template names to view the contents of an existing template, if the Linux environment will be more useful, I wait in the win under the rookie or right-click with sublime to see the content bar

scrapy genspider -t basic example example.com

This is the exciting way to build a crawler, through -t the parameters immediately after the content is 模板名称 新爬虫的名称 新爬虫允许爬取的域名 , seemingly general crawler name is basically the domain name of the subject of the ~, the smart students also know that 新爬虫名称 新爬虫允许爬取的域名 the corresponding is the previous previous tutorial mentioned nameand allowed_domains these two parameters.

1.3.2 Crawl Launch crawler command
    • Basic syntax:scrapy crawl <spider>

    • Whether the project needs to exist: Project order, decisive need ~

This command is very exciting, every time the crawler can not wait for a try, we in the last tutorial must have been tested. But seemingly only a single crawler, want to run a number of reptiles how to do? I now think of two solutions 1. Write a bat or shell script yourself 2. Add yourself a scrapy shell command (yes, I will tell you how to do in the future tutorial, I would like to see the next tutorial, hem ~, I will not ask you to recommend a collection it ~)

Case:

scrapy crawl njupt #咩哈哈,启动njupt爬虫吧骚年~
1.3.3 Check for crawler integrity
    • Basic syntax:scrapy check [-l] <spider>

    • Whether the project needs to exist: Project order, decisive need ~

This command is written by the official contect check and then no, I win7 under the interview, can check out a part of the error, but after adding a -l parameter seems useless ah, did not show the official document example of the Spider list and function list, there is a bunch warning , I came back to study the source code to see, this command is to check some grammar, import and warning Other errors, logic errors must not be found out ~

Case:

check njupt 
1.3.4 List View crawler lists command
    • Basic syntax:scrapy list

    • Whether the project needs to exist: Project order, decisive need ~

This command is to see what the current crawler in this project ~, wrote a lot of reptiles after playing this command has a parade of pleasure ~, the general Linux environment use more ~

Case:

scrapy list
1.3.5 Edit Crawler Command
    • Basic syntax:scrapy edit <spider>

    • Whether the project needs to exist: Project order, decisive need ~

Typical of a tall command on Linux, enter this command to instantly tune settings.py in the editor specified editor to open the crawler for editing (yes, it settings.py can also be used with such a cock parameter I was shocked). By the way, my Win7 system runs directly after the error ... Sad.

Case:

scrapy edit njupt
1.3.6 Parse
    • Basic syntax:scrapy parse <url> [options]

    • Whether the project needs to exist: Project order, decisive need ~

This method is suitable for testing your own spider and subsequent builds such as: and pipeline other combinations of use, I generally used to test their own spider (I have not seen this command before I have been the crawl command to test ...) Tragedy).

The supported parameters are quite rich:

    • --spider=SPIDER: If the spider is not specified by its program search, this option can be used to force the designation of a spider

    • --a NAME=VALUE: Used to set the parameters required by the spider, can be multiple

    • --callbackor -c : Specifies the function that is used for processing within the spider, which response is used by default without forced enactment. parse

    • --pipelines: Used to specify the following pipelines , can be flexibly customized OH ~

    • --rulesor -r : by CrawlSpider Setting the rules to select the corresponding function as response the resolved callback function

    • --noitems: does not display the crawleditems

    • --nolinks: The extracted links are not displayed

    • --nocolour: Do not highlight the results of the output (this option is not good to use)

    • --depthor -d : Set crawl depth, default is 1 Oh ~

    • --verboseor -v : Displays information about each layer that is crawled

Using columns:

scrapy parse http://www.njupt.edu.cn
1.3.7 Bench Hardware Test commands
    • Basic syntax: Scrapy bench

    • Whether the project needs to exist: not required

My personal understanding of this command is to do a crawler stress test on your hardware and see how quickly your hardware can run the crawler without considering the network. Of course, this is a bit of theoretical speed, actually you can't climb up this fast. Let's just be a way to tell you about hardware bottlenecks. But I win7 on the run seems to be no use, did not appear the official agreed to the various parameters, come back to have time to study carefully ah.

Case:

scrapy bench

I finally finished, like the collection + recommend it, so I will be more motivated to write a new tutorial, wow haha ~

Original address: 1190000003509661

Scrapy shell command (GO)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.