Several ways to run multiple scrapy crawlers simultaneously (custom Scrapy project commands)

Source: Internet
Author: User
Tags python script

Just imagine that the previous experiments and examples have only one spider. However, the actual development of the crawler certainly more than one. In this case, there are a few questions: 1, how to create multiple crawlers in the same project? 2. How do you run them up when you have multiple crawlers?

Description: This article is based on the previous articles and experiments on the basis of the completion. If you miss, or have doubts, where you can view:

Install Python crawler scrapy tread on those pits and think outside of the programming

Scrapy Crawler Growth Diary Creation project-extract data-save data in JSON format

Scrapy crawler growth Diary write crawl content to MySQL database

How to make your scrapy crawler no longer banned by ban

First, create spider

1. Create multiple spiders,scrapy genspider spidername domain

Scrapy Genspider Cnblogshomespider cnblogs.com

The above command creates a spider named Cnblogshomespider, Start_urls as a http://www.cnblogs.com/crawler.

2. See a few crawlers under the project Scrapy list

[[email protected] cnblogs] # scrapy List Cnblogshomespidercnblogsspider

So I can tell that there are two spiders under my project, one named Cnblogshomespider and the other called Cnblogsspider.

For more information about the Scrapy command, refer to: http://doc.scrapy.org/en/latest/topics/commands.html

Second, let a few spiders run together at the same time

Now that we have two spiders in our project, how can we get two spiders to run at the same time? You might say that you write a shell script one at a call, or you might say that you write a Python script to run it. But I saw it on the stackoverflow.com. It is true that there is no such thing as a predecessor. However, this is the official document.

1. Run Scrapy from a script
Importscrapy fromScrapy.crawlerImportcrawlerprocessclassMyspider (scrapy. Spider):#Your Spider definition... process=crawlerprocess ({'user_agent':'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}) Process.crawl (Myspider) Process.Start ()#The script would block here until the crawling is finished

Here mainly throughscrapy.crawler.CrawlerProcess来实现在脚本里运行一个spider。更多的例子可以在此查看:https://github.com/scrapinghub/testspiders

2. Running multiple spiders in the same process
    • by Crawlerprocess
Importscrapy fromScrapy.crawlerImportcrawlerprocessclassMySpider1 (scrapy. Spider):#Your first Spider definition    ...classMySpider2 (scrapy. Spider):#Your second Spider definition... process=crawlerprocess () process.crawl (MySpider1) process.crawl (MySpider2) Process.Start ()#The script would block here until all crawling jobs is finished
    • by Crawlerrunner
Importscrapy fromTwisted.internetImportreactor fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.logImportconfigure_loggingclassMySpider1 (scrapy. Spider):#Your first Spider definition    ...classMySpider2 (scrapy. Spider):#Your second Spider definition... configure_logging () Runner=Crawlerrunner () runner.crawl (MySpider1) runner.crawl (MySpider2) d=Runner.join () D.addboth (Lambda_: Reactor.stop ()) Reactor.run ()#The script would block here until all crawling jobs is finished
    • Linear operation via Crawlerrunner and link (chaining) deferred
 fromTwisted.internetImportreactor, defer fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.logImportconfigure_loggingclassMySpider1 (scrapy. Spider):#Your first Spider definition    ...classMySpider2 (scrapy. Spider):#Your second Spider definition... configure_logging () Runner=Crawlerrunner () @defer. InlinecallbacksdefCrawl ():yieldrunner.crawl (MySpider1)yieldrunner.crawl (MySpider2) reactor.stop () crawl () Reactor.run ()#The script would block here until the last crawl call is finished

This is an official document that provides several ways to run the spider in script.

Third, the way to run by customizing the Scrapy command

To create a project command, refer to: Http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

1. Create commands Directory

mkdir Commands

Note: The commands and spiders directories are sibling

2. Add a file under commands crawlall.py

This is done by modifying the Scrapy Crawl command to perform the spider's effect simultaneously. Crawl source code can be viewed here: https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

 fromScrapy.commandsImportScrapycommand fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.confImportarglist_to_dictclassCommand (Scrapycommand): Requires_project=Truedefsyntax (self):return '[Options]'        defShort_desc (self):return 'Runs All of the Spiders'      defadd_options (self, parser): Scrapycommand.add_options (self, parser) parser.add_option (" -A", dest="Spargs", action="Append", default=[], metavar="Name=value", Help="set spider argument (may be repeated)") parser.add_option ("- o","--output", metavar="FILE", Help="dump scraped items into FILE (use-for stdout)") parser.add_option ("- T","--output-format", metavar="FORMAT", Help="Format to use for dumping items With-o")    defprocess_options (self, args, opts): Scrapycommand.process_options (self, args, opts)Try: Opts.spargs=arglist_to_dict (Opts.spargs)exceptValueError:RaiseUsageerror ("invalid-a value, use-a name=value", print_help=False)defrun (self, args, opts):#settings = get_project_settings ()Spider_loader=Self.crawler_process.spider_loader forSpidernameinchArgsorspider_loader.list ():Print "*********cralall spidername************"+spidername self.crawler_process.crawl (Spidername,**Opts.spargs) Self.crawler_process.start ()

This is mostly using the Self.crawler_process.spider_loader.list () method to get all the spiders under the project and then using Self.crawler_process.crawl to run the spider

3. Add __init__.py file under commands command

Touch __init__.py

  Note: This step must not be omitted. that's the problem I've been tossing all day. I'm so embarrassed ... Just blame yourself for halfway decent.

If omitted, it is reported that an exception

Traceback (most recent): File"/usr/local/bin/scrapy", Line 9,inch<module>Load_entry_point ('Scrapy==1.0.0rc2','console_scripts','scrapy') () File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122,inchExecute Cmds=_get_commands_dict (Settings, inproject) File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50,inch_get_commands_dict cmds.update (_get_commands_from_module (Cmds_module, Inproject)) File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29,inch_get_commands_from_module forCmdinch_iter_command_classes (module): File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20,inch_iter_command_classes forModuleinchwalk_modules (module_name): File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63,inchwalk_modules MoD=import_module (path) File"/usr/local/lib/python2.7/importlib/__init__.py", line 37,inchImport_module__import__(name) importerror:no module named Commands

I couldn't find the reason at first. Consumed me all day, later to http://stackoverflow.com/on the help of netizens. Thank you again for the Almighty Internet, if there is no wall that is how beautiful! Pull away and keep coming back.

4, settings.py directory to create setup.py ( This step to remove also did not affect, do not know the official website to help the document so that the specific meaning of writing. )

 from Import Setup, find_packagessetup (name='scrapy-mymodule',  entry_points ={    'scrapy.commands': [      'crawlall =cnblogs.commands:crawlall',    ],  },)

The meaning of this file is to define a crawlall command, cnblogs.commands for the command file directory, crawlall for the command name.

5. Add the configuration in settings.py:

' Cnblogs.commands '

6. Run the command scrapy Crawlall

Last source updated to: Https://github.com/jackgitgz/CnblogsSpider

Several ways to run multiple scrapy crawlers simultaneously (custom Scrapy project commands)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.