Just imagine that the previous experiments and examples have only one spider. However, the actual development of the crawler certainly more than one. In this case, there are a few questions: 1, how to create multiple crawlers in the same project? 2. How do you run them up when you have multiple crawlers?
Description: This article is based on the previous articles and experiments on the basis of the completion. If you miss, or have doubts, where you can view:
Install Python crawler scrapy tread on those pits and think outside of the programming
Scrapy Crawler Growth Diary Creation project-extract data-save data in JSON format
Scrapy crawler growth Diary write crawl content to MySQL database
How to make your scrapy crawler no longer banned by ban
First, create spider
1. Create multiple spiders,scrapy genspider spidername domain
Scrapy Genspider Cnblogshomespider cnblogs.com
The above command creates a spider named Cnblogshomespider, Start_urls as a http://www.cnblogs.com/crawler.
2. See a few crawlers under the project Scrapy list
[[email protected] cnblogs] # scrapy List Cnblogshomespidercnblogsspider
So I can tell that there are two spiders under my project, one named Cnblogshomespider and the other called Cnblogsspider.
For more information about the Scrapy command, refer to: http://doc.scrapy.org/en/latest/topics/commands.html
Second, let a few spiders run together at the same time
Now that we have two spiders in our project, how can we get two spiders to run at the same time? You might say that you write a shell script one at a call, or you might say that you write a Python script to run it. But I saw it on the stackoverflow.com. It is true that there is no such thing as a predecessor. However, this is the official document.
1. Run Scrapy from a script
Importscrapy fromScrapy.crawlerImportcrawlerprocessclassMyspider (scrapy. Spider):#Your Spider definition... process=crawlerprocess ({'user_agent':'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}) Process.crawl (Myspider) Process.Start ()#The script would block here until the crawling is finished
Here mainly throughscrapy.crawler.CrawlerProcess来实现在脚本里运行一个spider。更多的例子可以在此查看:https://github.com/scrapinghub/testspiders
2. Running multiple spiders in the same process
Importscrapy fromScrapy.crawlerImportcrawlerprocessclassMySpider1 (scrapy. Spider):#Your first Spider definition ...classMySpider2 (scrapy. Spider):#Your second Spider definition... process=crawlerprocess () process.crawl (MySpider1) process.crawl (MySpider2) Process.Start ()#The script would block here until all crawling jobs is finished
Importscrapy fromTwisted.internetImportreactor fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.logImportconfigure_loggingclassMySpider1 (scrapy. Spider):#Your first Spider definition ...classMySpider2 (scrapy. Spider):#Your second Spider definition... configure_logging () Runner=Crawlerrunner () runner.crawl (MySpider1) runner.crawl (MySpider2) d=Runner.join () D.addboth (Lambda_: Reactor.stop ()) Reactor.run ()#The script would block here until all crawling jobs is finished
- Linear operation via Crawlerrunner and link (chaining) deferred
fromTwisted.internetImportreactor, defer fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.logImportconfigure_loggingclassMySpider1 (scrapy. Spider):#Your first Spider definition ...classMySpider2 (scrapy. Spider):#Your second Spider definition... configure_logging () Runner=Crawlerrunner () @defer. InlinecallbacksdefCrawl ():yieldrunner.crawl (MySpider1)yieldrunner.crawl (MySpider2) reactor.stop () crawl () Reactor.run ()#The script would block here until the last crawl call is finished
This is an official document that provides several ways to run the spider in script.
Third, the way to run by customizing the Scrapy command
To create a project command, refer to: Http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands
1. Create commands Directory
mkdir Commands
Note: The commands and spiders directories are sibling
2. Add a file under commands crawlall.py
This is done by modifying the Scrapy Crawl command to perform the spider's effect simultaneously. Crawl source code can be viewed here: https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py
fromScrapy.commandsImportScrapycommand fromScrapy.crawlerImportCrawlerrunner fromScrapy.utils.confImportarglist_to_dictclassCommand (Scrapycommand): Requires_project=Truedefsyntax (self):return '[Options]' defShort_desc (self):return 'Runs All of the Spiders' defadd_options (self, parser): Scrapycommand.add_options (self, parser) parser.add_option (" -A", dest="Spargs", action="Append", default=[], metavar="Name=value", Help="set spider argument (may be repeated)") parser.add_option ("- o","--output", metavar="FILE", Help="dump scraped items into FILE (use-for stdout)") parser.add_option ("- T","--output-format", metavar="FORMAT", Help="Format to use for dumping items With-o") defprocess_options (self, args, opts): Scrapycommand.process_options (self, args, opts)Try: Opts.spargs=arglist_to_dict (Opts.spargs)exceptValueError:RaiseUsageerror ("invalid-a value, use-a name=value", print_help=False)defrun (self, args, opts):#settings = get_project_settings ()Spider_loader=Self.crawler_process.spider_loader forSpidernameinchArgsorspider_loader.list ():Print "*********cralall spidername************"+spidername self.crawler_process.crawl (Spidername,**Opts.spargs) Self.crawler_process.start ()
This is mostly using the Self.crawler_process.spider_loader.list () method to get all the spiders under the project and then using Self.crawler_process.crawl to run the spider
3. Add __init__.py file under commands command
Touch __init__.py
Note: This step must not be omitted. that's the problem I've been tossing all day. I'm so embarrassed ... Just blame yourself for halfway decent.
If omitted, it is reported that an exception
Traceback (most recent): File"/usr/local/bin/scrapy", Line 9,inch<module>Load_entry_point ('Scrapy==1.0.0rc2','console_scripts','scrapy') () File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122,inchExecute Cmds=_get_commands_dict (Settings, inproject) File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50,inch_get_commands_dict cmds.update (_get_commands_from_module (Cmds_module, Inproject)) File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29,inch_get_commands_from_module forCmdinch_iter_command_classes (module): File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20,inch_iter_command_classes forModuleinchwalk_modules (module_name): File"/usr/local/lib/python2.7/site-packages/scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63,inchwalk_modules MoD=import_module (path) File"/usr/local/lib/python2.7/importlib/__init__.py", line 37,inchImport_module__import__(name) importerror:no module named Commands
I couldn't find the reason at first. Consumed me all day, later to http://stackoverflow.com/on the help of netizens. Thank you again for the Almighty Internet, if there is no wall that is how beautiful! Pull away and keep coming back.
4, settings.py directory to create setup.py ( This step to remove also did not affect, do not know the official website to help the document so that the specific meaning of writing. )
from Import Setup, find_packagessetup (name='scrapy-mymodule', entry_points ={ 'scrapy.commands': [ 'crawlall =cnblogs.commands:crawlall', ], },)
The meaning of this file is to define a crawlall command, cnblogs.commands for the command file directory, crawlall for the command name.
5. Add the configuration in settings.py:
' Cnblogs.commands '
6. Run the command scrapy Crawlall
Last source updated to: Https://github.com/jackgitgz/CnblogsSpider
Several ways to run multiple scrapy crawlers simultaneously (custom Scrapy project commands)