Scrapy is a fast, high-level screen capture and web crawling framework developed by Python for crawling web sites and extracting structured data from pages. The most fascinating thing about it is that anyone can easily modify it as needed.
MongoDB is now a very popular open source non-relational database (NOSQL), it is in the form of "Key-value" to store data, in the large data volume, high concurrency, weak transactions have a great advantage.
What is the spark when Scrapy and MongoDB collide? What kind of spark does it have to collide with MongoDB? Now let's do a simple crawl of the novel Test
1. Installing Scrapy
Pip Install Scrapy
2. Download and install MongoDB and Mongovue visualizations
[MongoDB] (https://www.mongodb.org/)
The steps to download the installation are skipped, and a data folder is created in the bin directory to hold the information.
[Mongovue] (http://www.mongovue.com/)
After the installation is complete, we need to create a database.
3. Create a Scrapy project
Scrapy Startproject Novelspider
Directory structure: The novspider.py is required to be created manually (Contrlodb does not need to be ignored)
4. Writing code
Target website: http://www.daomubiji.com/
settings.py
Bot_name = ' Novelspider ' spider_modules = [' novelspider.spiders ']newspider_module = ' novelspider.spiders ' ITEM_ pipelines = [' Novelspider.pipelines.NovelspiderPipeline '] #导入pipelines. py method user_agent = ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) gecko/20100101 firefox/39.0 ' cookies_enabled = truemongodb_host = ' 127.0.0.1 ' mongodb_port = 27017MONGODB_ DBNAME = ' Zzl ' #数据库名MONGODB_DOCNAME = ' book ' #表名
pipelines.py
From scrapy.conf import Settingsimport pymongoclass novelspiderpipeline (object):d EF __init__ (self): host = Settings[' Mongodb_host ']port = settings[' mongodb_port ']dbname = settings[' mongodb_dbname ']client = Pymongo. Mongoclient (Host=host, port=port) TDB = Client[dbname]self.post = tdb[settings[' Mongodb_docname ']]def process_item ( Self, item, spider): BookInfo = Dict (item) Self.post.insert (BookInfo) return item
items.py
From scrapy import Item,fieldclass Novelspideritem (Item): # define the fields for your Item here like:# name = Scrapy. Field () BookName = field () booktitle = field () chapternum = field () Chaptername = field () Chapterurl = field ()
Create novspider.py under the Spiders directory
From scrapy.spiders import crawlspiderfrom scrapy.selector import selectorfrom novelspider.items Import Novelspideritemclass Novspider (crawlspider): name = "Novspider" Redis_key = ' novspider:start_urls ' start_urls = [' http:/ /www.daomubiji.com/']def Parse (self,response): selector = selector (response) Table = Selector.xpath ('//table ') for each In table:bookname = Each.xpath (' tr/td[@colspan = "3"]/center/h2/text () '). Extract () [0]content = Each.xpath (' tr/td/a/ Text () '). Extract () URL = Each.xpath (' tr/td/a/@href '). Extract () for I in range (len (URL)): item = Novelspideritem () item[' BookName '] = booknameitem[' chapterurl '] = url[i]try:item[' booktitle '] = Content[i].split (') [0]item[' chapterNum '] = Content[i].split (") [1]except exception,e:continuetry:item[' chaptername '] = Content[i].split (") [2]except exception,e:item[' chaptername ' = Content[i].split (") [1][-3:]yield Item
5. Start the project command: Scrapy crawl novspider.
Fetch results
Scrapy and MongoDB applications