Several ways to run multiple scrapy crawlers simultaneously (custom Scrapy project commands)

Last Update:2015-06-16 Source: Internet

Author: User

Tags python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Just imagine that the previous experiments and examples have only one spider. However, the actual development of the crawler certainly more than one. In this case, there are a few questions: 1, how to create multiple crawlers in the same project? 2. How do you run them up when you have multiple crawlers?

Description: This article is based on the previous articles and experiments on the basis of the completion. If you miss, or have doubts, where you can view:

Install Python crawler scrapy tread on those pits and think outside of the programming

Scrapy Crawler Growth Diary Creation project-extract data-save data in JSON format

Scrapy crawler growth Diary write crawl content to MySQL database

How to make your scrapy crawler no longer banned by ban

First, create spider

1. Create multiple spiders,scrapy genspider spidername domain

Scrapy Genspider Cnblogshomespider cnblogs.com

The above command creates a spider named Cnblogshomespider, Start_urls as a http://www.cnblogs.com/crawler.

2. See a few crawlers under the project Scrapy list

[[email protected] cnblogs] # scrapy List Cnblogshomespidercnblogsspider

So I can tell that there are two spiders under my project, one named Cnblogshomespider and the other called Cnblogsspider.

For more information about the Scrapy command, refer to: http://doc.scrapy.org/en/latest/topics/commands.html

Second, let a few spiders run together at the same time

Now that we have two spiders in our project, how can we get two spiders to run at the same time? You might say that you write a shell script one at a call, or you might say that you write a Python script to run it. But I saw it on the stackoverflow.com. It is true that there is no such thing as a predecessor. However, this is the official document.

1. Run Scrapy from a script

Importscrapy fromScrapy.crawlerImportcrawlerprocessclassMyspider (scrapy. Spider):#Your Spider definition... process=crawlerprocess ({'user_agent':'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}) Process.crawl (Myspider) Process.Start ()#The script would block here until the crawling is finished

Here mainly throughscrapy.crawler.CrawlerProcess来实现在脚本里运行一个spider。更多的例子可以在此查看：https://github.com/scrapinghub/testspiders

2. Running multiple spiders in the same process

by Crawlerprocess

Importscrapy fromScrapy.crawlerImportcrawlerprocessclassMySpider1 (scrapy. Spider):#Your first Spider definition    ...classMySpider2 (scrapy. Spider):#Your second Spider definition... process=crawlerprocess () process.crawl (MySpider1) process.crawl (MySpider2) Process.Start ()#The script would block here until all crawling jobs is finished

by Crawlerrunner

Importscrapy fromTwisted.internetImportreactor fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.logImportconfigure_loggingclassMySpider1 (scrapy. Spider):#Your first Spider definition    ...classMySpider2 (scrapy. Spider):#Your second Spider definition... configure_logging () Runner=Crawlerrunner () runner.crawl (MySpider1) runner.crawl (MySpider2) d=Runner.join () D.addboth (Lambda_: Reactor.stop ()) Reactor.run ()#The script would block here until all crawling jobs is finished

Linear operation via Crawlerrunner and link (chaining) deferred

 fromTwisted.internetImportreactor, defer fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.logImportconfigure_loggingclassMySpider1 (scrapy. Spider):#Your first Spider definition    ...classMySpider2 (scrapy. Spider):#Your second Spider definition... configure_logging () Runner=Crawlerrunner () @defer. InlinecallbacksdefCrawl ():yieldrunner.crawl (MySpider1)yieldrunner.crawl (MySpider2) reactor.stop () crawl () Reactor.run ()#The script would block here until the last crawl call is finished

This is an official document that provides several ways to run the spider in script.

Third, the way to run by customizing the Scrapy command

To create a project command, refer to: Http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

1. Create commands Directory

mkdir Commands

Note: The commands and spiders directories are sibling

2. Add a file under commands crawlall.py

This is done by modifying the Scrapy Crawl command to perform the spider's effect simultaneously. Crawl source code can be viewed here: https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

 fromScrapy.commandsImportScrapycommand fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.confImportarglist_to_dictclassCommand (Scrapycommand): Requires_project=Truedefsyntax (self):return '[Options]'        defShort_desc (self):return 'Runs All of the Spiders'      defadd_options (self, parser): Scrapycommand.add_options (self, parser) parser.add_option (" -A", dest="Spargs", action="Append", default=[], metavar="Name=value", Help="set spider argument (may be repeated)") parser.add_option ("- o","--output", metavar="FILE", Help="dump scraped items into FILE (use-for stdout)") parser.add_option ("- T","--output-format", metavar="FORMAT", Help="Format to use for dumping items With-o")    defprocess_options (self, args, opts): Scrapycommand.process_options (self, args, opts)Try: Opts.spargs=arglist_to_dict (Opts.spargs)exceptValueError:RaiseUsageerror ("invalid-a value, use-a name=value", print_help=False)defrun (self, args, opts):#settings = get_project_settings ()Spider_loader=Self.crawler_process.spider_loader forSpidernameinchArgsorspider_loader.list ():Print "*********cralall spidername************"+spidername self.crawler_process.crawl (Spidername,**Opts.spargs) Self.crawler_process.start ()

This is mostly using the Self.crawler_process.spider_loader.list () method to get all the spiders under the project and then using Self.crawler_process.crawl to run the spider

3. Add __init__.py file under commands command

Touch __init__.py

　　Note: This step must not be omitted. that's the problem I've been tossing all day. I'm so embarrassed ... Just blame yourself for halfway decent.

If omitted, it is reported that an exception

Traceback (most recent): File"/usr/local/bin/scrapy", Line 9,inch<module>Load_entry_point ('Scrapy==1.0.0rc2','console_scripts','scrapy') () File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122,inchExecute Cmds=_get_commands_dict (Settings, inproject) File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50,inch_get_commands_dict cmds.update (_get_commands_from_module (Cmds_module, Inproject)) File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29,inch_get_commands_from_module forCmdinch_iter_command_classes (module): File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20,inch_iter_command_classes forModuleinchwalk_modules (module_name): File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63,inchwalk_modules MoD=import_module (path) File"/usr/local/lib/python2.7/importlib/__init__.py", line 37,inchImport_module__import__(name) importerror:no module named Commands

I couldn't find the reason at first. Consumed me all day, later to http://stackoverflow.com/on the help of netizens. Thank you again for the Almighty Internet, if there is no wall that is how beautiful! Pull away and keep coming back.

4, settings.py directory to create setup.py ( This step to remove also did not affect, do not know the official website to help the document so that the specific meaning of writing. )

 from Import Setup, find_packagessetup (name='scrapy-mymodule',  entry_points ={    'scrapy.commands': [      'crawlall =cnblogs.commands:crawlall',    ],  },)

The meaning of this file is to define a crawlall command, cnblogs.commands for the command file directory, crawlall for the command name.

5. Add the configuration in settings.py:

' Cnblogs.commands '

6. Run the command scrapy Crawlall

Last source updated to: Https://github.com/jackgitgz/CnblogsSpider

Several ways to run multiple scrapy crawlers simultaneously (custom Scrapy project commands)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More