MrzontPublished on August 29, 2015
- Recommended1Recommended
- Collection theCollection3kBrowse
0. Basic Environment Description
This article and the operating environment are implemented on the WIN8 (yes, I have a holiday home, the home machine is Win8 no way), but the basic steps and Win 7 environment basically the same. PS: I later changed the computer win7, so the environment of this article is Win8 and Win7 of the combination of ~, wow haha ~, but this does not have any eggs ~
Scrapy version is 1.0.3
This article is basically a very shameless translation of the official documents by the way add a bit of their own understanding
References and download links:
Like the point of recommendation + collection to ~ so I will know what I have written, wow haha ~
I use the operating system is 64-bit, but python with 32-bit, because 32-bit package is full, but I recently migrated to 64-bit server, then there are problems I will be added in time.
1. Scrapy shell commands in detail 1.1 commands overview
You can first view all the Scrapy available command types by using the following command:
scrapy -h
Scrapy current commands can be divided into Project command and Global command two categories, a total of 14 (well, I seriously count two times), the distribution is extremely symmetrical, the project-level command 7 Global Command 7 (well, I counted it again seriously). respectively:
Global command
Startproject
Settings
Runspider
Shell
Fetch
View
Version
Project commands
Crawl
Check
List
Edit
Parse
Genspider
Bench
Wow haha, then we began to learn these commands, some of the commands to use the OH ~
1.2 Global Command Resolution 1.2.1 Startproject Create Project command
In fact, this command we have used in the previous tutorial, should not be unfamiliar, is to create a named project_name
Crawler framework for us to teach (wretched face). Is the first step to create a reptile project.
Case
scrapy startproject njupt #嗯,我无耻的修改了官方的例子,在运行命令行的目录下面创建了一个名为njupt的文件夹,里面存放了一个名为njupt的scrapy框架
1.2.2 Settings View configuration file Parameters command
Basic syntax:scrapy settings [options]
Whether the project needs to exist: Of course, no need to ~ (this is the official writing, seemingly a bit of a problem)
Well, this command is used to view the project parameters. Official documents said no project, after I test, if in the project directory, will show the project contents of the settings.py
corresponding project, if the actual project directory, it seems to return a scrapy default value, I personally recommend that you use this command in the project directory to view settings.py
The contents are OK.
Case
scrapy settings --get BOT_NAME
1.2.3 Runspider Run crawler command
It seems that this command is a command that scrapy
runs a crawler without relying on a single project. Weak weak to say, this is not settings
without pipelines
the crawler really good
Case
scrapy runspider njupt.py
1.2.4 Shell creates a shell environment for debugging response commands (very important!!!). )
This command is really important, the main purpose is to create a shell environment for debugging Response command (well, and the title is exactly the same), because she is too important, so I decided to come back specifically to write an article to introduce this command, eager to go through the official documents of the classmate ~ (actually I was tired, Because I am on the side of the code to reload the computer and now also more than 12 in the middle of the night ~ agreed to go to bed and fall in the back)
Case:
scrapy shell http://www.njupt.edu.cn
1.2.5 fetch shows the crawl process
This command actually shows the entire process of invoking a crawler to crawl the specified URL in the standard output. One thing to note is that if you use this command in a project directory, the crawler in this project will be called by default, and if you use this command outside of the project directory, the scrapy default crawler will be called to crawl. So there are no projects available to run.
Case:
scrapy fetch http://www.njupt.edu.cn #会显示抓取的过程,以及抓取的html的内容
--nolog --headers http://www.njupt.edu.com/#可以添加一些参数来,观察各种信息,如添加个:--nolog 可以忽略很多烦人的日志--headers 用来查看请求时的头部信息
1.2.6 View page Content command
When you can't extract some information, you should consider using this view
, this command is for you to download a page and open the browser , in order to let you compare the Scrapy "see" Page and the page you see through the browser is different, This is useful for some dynamically generated Web pages! but there's a hole here . I am in the Win7 environment, the use of this command when downloading the corresponding Web page with the browser open (Visual pure command line GUI-free Linux because it does not automatically open the browser), when the Ajax because of the browser opened, is executed again, so the page opened by the browser should not be the same as the page you normally visit, but it is not really scrapy to see the page. How to see inside a real page? Very simple, find the original address with sublime Open is ~. The original address is in the address bar of the browser Oh ~
Case:
scrapy view http://item.jd.com/1319191.html#嗯,我背叛了大njupt,因为我们学校的网站太牛X没有使用ajax,所以我就用大JD做例子了。
1.2.7 version Display release information
This command is simple, is to display the version of Scrapy, if you add a -v
command will also show Python, twisted and platform information, seemingly for bug search and report very helpful!
1.3 Project Command parsing 1.3.1 Genspider creating crawlers from Templates
Basic syntax:scrapy genspider [-t template] <name> <domain>
Whether the project needs to exist: Project order, decisive need ~
This command is mainly to help us in the writing of multiple crawlers, the use of existing crawlers to quickly generate new crawlers, of course, this is not the only way to create new reptiles, Sao years, not too tired words can re-knock a ~
Case:
scrapy genspider -l
-l
to view existing crawler templates by adding parameters
scrapy genspider -d basic
By adding parameters -d
and template names to view the contents of an existing template, if the Linux environment will be more useful, I wait in the win under the rookie or right-click with sublime to see the content bar
scrapy genspider -t basic example example.com
This is the exciting way to build a crawler, through -t
the parameters immediately after the content is 模板名称
新爬虫的名称
新爬虫允许爬取的域名
, seemingly general crawler name is basically the domain name of the subject of the ~, the smart students also know that 新爬虫名称
新爬虫允许爬取的域名
the corresponding is the previous previous tutorial mentioned name
and allowed_domains
these two parameters.
1.3.2 Crawl Launch crawler command
Basic syntax:scrapy crawl <spider>
Whether the project needs to exist: Project order, decisive need ~
This command is very exciting, every time the crawler can not wait for a try, we in the last tutorial must have been tested. But seemingly only a single crawler, want to run a number of reptiles how to do? I now think of two solutions 1. Write a bat or shell script yourself 2. Add yourself a scrapy shell command (yes, I will tell you how to do in the future tutorial, I would like to see the next tutorial, hem ~, I will not ask you to recommend a collection it ~)
Case:
scrapy crawl njupt #咩哈哈,启动njupt爬虫吧骚年~
1.3.3 Check for crawler integrity
Basic syntax:scrapy check [-l] <spider>
Whether the project needs to exist: Project order, decisive need ~
This command is written by the official contect check and then no, I win7 under the interview, can check out a part of the error, but after adding a -l
parameter seems useless ah, did not show the official document example of the Spider list and function list, there is a bunch warning
, I came back to study the source code to see, this command is to check some grammar, import
and warning
Other errors, logic errors must not be found out ~
Case:
check njupt
1.3.4 List View crawler lists command
This command is to see what the current crawler in this project ~, wrote a lot of reptiles after playing this command has a parade of pleasure ~, the general Linux environment use more ~
Case:
scrapy list
1.3.5 Edit Crawler Command
Typical of a tall command on Linux, enter this command to instantly tune settings.py
in the editor
specified editor to open the crawler for editing (yes, it settings.py
can also be used with such a cock parameter I was shocked). By the way, my Win7 system runs directly after the error ... Sad.
Case:
scrapy edit njupt
1.3.6 Parse
Basic syntax:scrapy parse <url> [options]
Whether the project needs to exist: Project order, decisive need ~
This method is suitable for testing your own spider and subsequent builds such as: and pipeline
other combinations of use, I generally used to test their own spider (I have not seen this command before I have been the crawl
command to test ...) Tragedy).
The supported parameters are quite rich:
--spider=SPIDER
: If the spider is not specified by its program search, this option can be used to force the designation of a spider
--a NAME=VALUE
: Used to set the parameters required by the spider, can be multiple
--callback
or -c
: Specifies the function that is used for processing within the spider, which response
is used by default without forced enactment. parse
--pipelines
: Used to specify the following pipelines
, can be flexibly customized OH ~
--rules
or -r
: by CrawlSpider
Setting the rules to select the corresponding function as response
the resolved callback function
--noitems
: does not display the crawleditems
--nolinks
: The extracted links are not displayed
--nocolour
: Do not highlight the results of the output (this option is not good to use)
--depth
or -d
: Set crawl depth, default is 1 Oh ~
--verbose
or -v
: Displays information about each layer that is crawled
Using columns:
scrapy parse http://www.njupt.edu.cn
1.3.7 Bench Hardware Test commands
My personal understanding of this command is to do a crawler stress test on your hardware and see how quickly your hardware can run the crawler without considering the network. Of course, this is a bit of theoretical speed, actually you can't climb up this fast. Let's just be a way to tell you about hardware bottlenecks. But I win7 on the run seems to be no use, did not appear the official agreed to the various parameters, come back to have time to study carefully ah.
Case:
scrapy bench
I finally finished, like the collection + recommend it, so I will be more motivated to write a new tutorial, wow haha ~
Original address: 1190000003509661
Scrapy shell command (GO)