Introduction to the Scrapy shell command "options"

Source: Internet
Author: User
Tags xpath



When you use the Scrapy Shell to test a Web site, it returns a bad request, so change the user-agent header information and try again.



Debug:crawled (<get) https://www. A website .com> (referer:none)






However, how to change it?






Use the Scrapy Shell--help command to see its usage:






The options do not have the corresponding option found;



What about Global options? The--set/-s command inside can set/override the configuration.






The user-agent configuration was changed using the-s option, and then a Web site was tested to return to the page successfully (state 200):



... >scrapy shell- s user_agent= "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrom e/59.0.3071.86 safari/537.36 " https://www. A Web site. com



2018-07-15 12:11:00 [Scrapy.core.engine] debug:crawled (a) <get https://www. A website .com> (referer:none)






--------Turn--------






Note that, in fact, the use of the -s is not what you know through the above steps (have been concerned about the options below, ignoring the global options, feel useless?) ), instead, search through the web and see the following article:



Scrapy shell usage (slowly update ... ) Original Wood wood & regaling (a friend, the original link)









Further: In the source code of the Scrapy, the relevant configuration items will have more detailed information.



Open the C:\Python36\Lib\site-packages\scrapy\commands directory, you can see a variety of built-in scrapy command python files, where shell.py is the scrapy shell command source files.









From the source, you can see that the command class is defined--inherited by Scrapy.commands.ScrapyCommand, and added three options in the command's add_options function:



-c:evaluate the code in the shell, print the result and exit (execute a parse code?). )



--spider:use this spider



--no-redirect:do not handle HTTP 3xx status codes and print response As-is






The-s option was not found, so where is the-s option from? See the source code for Scrapy.commands.ScrapyCommand (__init__.py file). You can see that the-s option is added to the add_options function below:


 
 
 1 def add_options(self, parser):
 2     """
 3     Populate option parse with options available for this command
 4     """
 5     group = OptionGroup(parser, "Global Options")
 6     group.add_option("--logfile", metavar="FILE",
 7         help="log file. if omitted stderr will be used")
 8     group.add_option("-L", "--loglevel", metavar="LEVEL", default=None,
 9         help="log level (default: %s)" % self.settings[‘LOG_LEVEL‘])
10     group.add_option("--nolog", action="store_true",
11         help="disable logging completely")
12     group.add_option("--profile", metavar="FILE", default=None,
13         help="write python cProfile stats to FILE")
14     group.add_option("--pidfile", metavar="FILE",
15         help="write process ID to FILE")
16     group.add_option("-s", "--set", action="append", default=[], metavar="NAME=VALUE",
17         help="set/override setting (may be repeated)")
18     group.add_option("--pdb", action="store_true", help="enable pdb on failure")
19 
20     parser.add_option_group(group)





Well, the source found it.






However, before looking for ways to find, Scrapy Crawl, Runspider provides a-a option to set/rewrite the configuration, but, already has the-s option, why do you want to increase the-a option? What is the difference between the two?



From its interpretation, the-a option modifies only the parameters of the spider, and the-s can be set to a wider range, including all configurations in the frustration settings! (not tested)



Parser.add_option ("-a", dest= "Spargs", action= "append", default=[], metavar= "Name=value",
Help= "set spider argument (may be repeated)")












--------Turn--------






Practice 1:scrapy Sell's-c option



(env0626) D:\ws\env0626\ws>scrapy Shell -C "Response.xpath ('//title/text () ')" https://www.baidu.com



Output:



2018-07-15 13:07:23 [Scrapy.core.engine] debug:crawled ($) <get https://www.baidu.com> (Referer:none)
[<selector xpath= '//title/text () ' Data= ' Baidu a bit, you know ';]






Practice 2:scrapy runspider-a option and-s option to modify the User-agent request header


 
 
 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class MousiteSpider(scrapy.Spider):
 6     name = ‘mousite‘
 7     allowed_domains = [‘www.zhihu.com‘]
 8     start_urls = [‘https://www.zhihu.com/‘]
 9 
10     def parse(self, response):
11         yield response.xpath(‘//title/text()‘)


Test Result:-A option cannot get data, return 400;-s option can, return 200;



-A option:



(env0626) D:\ws\env0626\ws>scrapy Runspider -a user_agent= "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (KHTML, Like Gecko) chrome/59.0.3071.86 safari/537.36 "mousite.py



Debug:crawled (<get) https://www.zhihu.com/> (Referer:none)



info:ignoring response <400 Https://www.zhihu.com/>: HTTP status code isn't handled or not allowed



-S option:



(env0626) D:\ws\env0626\ws>scrapy Runspider- s user_agent= "mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (KHTML, Like Gecko) chrome/59.0.3071.86 safari/537.36 "mousite.py



Debug:crawled (<get) https://www.zhihu.com/> (Referer:none)



{' title ': [<selector xpath= '//title/text () ' Data= ' know-discover Greater world ';]}






It seems that there is a difference between the two .






Note that the above tests are performed outside of the Scrapy Project ().






Introduction to the Scrapy shell command "options"


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.