Scrapy is a fast, high-level screen capture and web crawling framework developed by Python for crawling web sites and extracting structured data from pages. The most fascinating thing about it is that anyone can easily modify it as needed.
MongoDB is now a very popular open source non-relational database (NOSQL), it is in the form of "Key-value" to store data, in the large data volume, high concurrency, weak transactions have a great advantage.
What is the spark when Scrapy and MongoDB collide? What kind of spark does it have to collide with MongoDB? Now let's do a simple crawl of the novel Test
1. Installing Scrapy
Pip Install Scrapy
2. Download and install MongoDB and Mongovue visualizations
[MongoDB] (https://www.mongodb.org/)
The steps to download the installation are skipped, and a data folder is created in the bin directory to hold the information.
[Mongovue] (http://www.mongovue.com/)
After the installation is complete, we need to create a database.
3. Create a Scrapy project
Scrapy Startproject Novelspider
Directory structure: The novspider.py is required to be created manually (Contrlodb does not need to be ignored)
4. Writing code
Target website:http://www.daomubiji.com/
settings.py
Bot_name ='Novelspider'Spider_modules= ['novelspider.spiders']newspider_module='novelspider.spiders'Item_pipelines= ['Novelspider.pipelines.NovelspiderPipeline'] #导入 method user_agent in pipelines. PY='mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) gecko/20100101 firefox/39.0'cookies_enabled=Truemongodb_host='127.0.0.1'Mongodb_port= 27017Mongodb_dbname='Zzl' #数据库名Mongodb_docname=' Book' #表名
pipelines.py
fromScrapy.confImportSettingsImportPymongoclassNovelspiderpipeline (object):def __init__(self): host= settings['Mongodb_host'] Port= settings['Mongodb_port'] DbName= settings['Mongodb_dbname'] Client= Pymongo. Mongoclient (Host=host, port=port) TDB=Client[dbname] Self.post= tdb[settings['Mongodb_docname']] defProcess_item (self, item, spider): BookInfo=dict (item) Self.post.insert (BookInfo)returnItem
items.py
from Import Item,field class Novelspideritem (Item): # Define the fields for your item here is like: # name = Scrapy. Field () bookname = field () = Field() = field () = Field () = field ()
Create novspider.py under the Spiders directory
fromScrapy.spidersImportCrawlspider fromScrapy.selectorImportSelector fromNovelspider.itemsImportNovelspideritemclassNovspider (crawlspider): Name="Novspider"Redis_key='Novspider:start_urls'Start_urls= ['http://www.daomubiji.com/'] defParse (self,response): selector=Selector (response) Table= Selector.xpath ('//table') foreachinchTable:bookname= Each.xpath ('tr/td[@colspan = "3"]/center/h2/text ()'). Extract () [0] content= Each.xpath ('Tr/td/a/text ()'). Extract () URL= Each.xpath ('tr/td/a/@href'). Extract () forIinchrange (len (URL)): Item=Novelspideritem () item['BookName'] =bookname item['Chapterurl'] =Url[i]Try: item['BookTitle'] = Content[i].split (' ') [0] item['Chapternum'] = Content[i].split (' ') [1] exceptexception,e:Continue Try: item['Chaptername'] = Content[i].split (' ') [2] exceptexception,e:item['Chaptername'] = Content[i].split (' ') [1][-3:] yieldItem
5. Start the project command: Scrapy crawl novspider.
Fetch results
Application---Crawling for scrapy and MongoDB