Introduce the Scrapy crawler frame
The installation method pip install Scrapy can be implemented. I use the Anaconda command to install scrapy for Conda.
1 engine get crawl request from spider
2Engine forwarding a crawl request to scheduler for scheduling
3 engine gets the next request to crawl from scheduler
4 engine sends crawl requests through middleware to downloader
5 after crawling the Web page, downloader form a response (Response) to the engine through the middleware
6 engine sends the received response through the middleware to spider processing engine to forward the crawl request to scheduler for dispatch
7 Spider generates a crawl item after processing a response (scraped item)
And a new crawl request (requests) to the engine
8 engine sends a crawl item to item Pipeline (frame exit)
9 engine sends a crawl request to scheduler
The engine controls the data flow of each module, continuously from the scheduler
Get a crawl request until the request is empty
Frame entry: The spider's initial crawl request
Frame exit: Item Pipeline
Engine
(1) Control of data flow between all modules
(2) Triggering events based on conditions
No user modification required
Downloader
Download Web pages on request
No user modification required
Scheduler
Scheduling management of all crawl requests
No user modification required
Downloader Middleware
Objective: To implement engine, scheduler and downloader
User-configurable controls between
Features: Modify, discard, add request or response
User can write configuration code
Spider
(1) Parsing the response returned by Downloader (Response)
(2) Generating a crawl item (scraped item)
(3) Generate additional crawl requests (request)
Requires user to write configuration code
Item Pipelines
(1) Handling crawl items generated by spiders in a pipelined manner
(2) consists of a set of sequence of operations, similar to the pipeline, each operating
is an item pipeline type
(3) Possible actions include: cleaning, checking, and checking for crawling items
HTML data, storing the data in a database
Requires user to write configuration code
After understanding the basic concepts, we began to write the first scrapy crawler.
Start with a new crawler project scrapy startproject xxx (project name)
This crawler simply crawls the title and author of a novel website.
We have now created the Crawler Project book now to edit his configuration
Level Two book directory these are the configuration files described above, before modifying these
We now create a start.py in the first level book directory to use for scrapy crawlers in the IDE
The face runs. Write the following code inside the file.
The first two parameters are fixed, and the third parameter is the name of your spider.
Next we'll fill in the fields in items:
Then create the crawler main program in the Spider book.py
We are going to crawl the site for http://book.km.com/
By clicking on the site different kinds of fiction will find the website address is http://book.km.com/+ novel kind of pinyin. html
Through this we'll write the contents of the Read page
After we get this, we use the parse function to parse the retrieved Web page and extract the required information.
Page analysis extract data is through the BeautifulSoup library, here is a little bit. Own analysis 2333~
The program is written so we're going to store the crawled information and edit the pipelines.py.
Two storage options available here
1 Save As TXT text
2 Deposit into the database
To make this run successfully we also need to configure the setting.py in the
Item_pipelines = { ' book.pipelines.xxx ': 300,}
XXX is the class name of the storage method, you want to use what method to store it to the name of the good run results nothing to be expected
This is the first reptile frame. The end of the busy time to continue to improve the crawler after the time to complete the crawler to the content of the novel to climb down the program to share a wave.
A complete code for the book is attached:
Import Scrapy fromBS4 Import BeautifulSoup frombook.items Import BookitemclassBookspider (scrapy. Spider): Name=' Book'#名字 allowed_domains= ['book.km.com'] #包含了spider允许爬取的域名 (domain) (list) Zurl='http://book.km.com/'def start_requests (self): D=['Jushi','Xuanhuan'] #数组里面包含了小说种类这里列举两个有需要可以自己添加 forIinchD: #通过循环遍历 URL=self.zurl+i+'. html' yieldScrapy. Request (URL, callback=self.parse) def parse (self, Response): IMF=beautifulsoup (Response.text,'lxml') b=imf.find_all ('DL', class_='Info') forIinchB:bookname=i.a.stringauthor= I.dd.span.stringItem=Bookitem () item['name'] =bookname item['author'] =authoryieldItem
A way of python crawling with salted fish (v): Scrapy Reptile Frame