Python crawler path of a salted fish (5): scrapy crawler framework, pythonscrapy
Introduction to scrapy crawler framework
Installation Method pip install scrapy. I use the anaconda command to install scrapy for conda.
1. The Engine obtains a Request from the Spider)
2Engine forwards the crawling request to Scheduler for scheduling.
3 Engine obtains the next request to be crawled from Scheduler
4. The Engine sends the crawling request to Downloader through middleware.
5. After crawling the webpage, Downloader generates a Response (Response) sent to the Engine through Middleware
6. The Engine sends the received response to the Spider through middleware to process the Engine and forward the crawler request to Scheduler for scheduling.
7. The Spider generates a crawler Item after processing the response (scraped Item)
And the new crawler request (Requests) to the Engine
8. The Engine sends the crawling Item to the Item Pipeline (Framework exit)
9 The Engine sends the crawling request to Scheduler
The Engine controls the data streams of each module from Scheduler without interruption.
Obtain the crawling request until the request is empty.
Framework entry: Spider's initial crawling request
Framework Exit: Item Pipeline
Engine
(1) control data streams between all modules
(2) trigger events based on conditions
No user modification required
Downloader
Download webpage as requested
No user modification required
Scheduler
Schedule and manage all crawling requests
No user modification required
Downloader Middleware
Objective: To implement Engine, schedloader, and Downloader
User-configurable Control
Function: Modify, discard, add a request or response
Users can write configuration code
Spider
(1) parse the Response returned by Downloader (Response)
(2) generate a crawler item (scraped item)
(3) generating additional crawling requests)
You need to write the configuration code
Item Pipelines
(1) process crawler items generated by Spider in pipeline Mode
(2) consists of a group of Operation Sequence, similar to the pipeline, each operation
Operation is an Item Pipeline type
(3) possible operations include cleaning, checking, and re-crawling items
HTML data, store the data to the database
You need to write the configuration code
After learning about the basic concepts, let's start writing the first scrapy crawler.
Create a crawler project scrapy startproject xxx (project name)
This crawler simply crawls the title and author of a novel website.
Now we have created a crawler project book. Now we can edit its configuration.
In the second-level book directory, these are the configuration files that have been introduced above. Before you modify these files
We now create a start. py file in the first-level book directory for scrapy crawlers
Run. Write the following code in the file.
The first two parameters are fixed, and the third parameter is your spider name.
Next we will fill in the field in items:
Create the crawler main program book. py in the spider.
The website we want to crawl is http://book.km.com/
By clicking different types of novels on the website, you will find that the website address is http://book.km.com/? .html.
Through this, we can write and read the content of the webpage.
After this result is obtained, we use the parse function to parse the obtained webpage and extract the required information.
Webpage analysis extracts data through the BeautifulSoup library, which is omitted here. Analyze by yourself 2333 ~
To store the crawled information, you need to edit Pipelines. py.
Two storage methods are provided here.
1. Save as txt text
2. store data in the database
To make this run successfully, we also need to configure it in setting. py.
ITEM_PIPELINES = {'book. pipelines. XXX': 300 ,}
Xxx is the class name of the storage method. If you want to use any method to store the class name, you can change it to that name, and the result will be omitted if there is nothing to look.
The first crawler framework is like this. At the end of the term, there is no time to continue to improve the crawler. After that, there is time to complete the crawler into a program that crawls the novel content and so on.
The complete code of a book is provided:
Import scrapyfrom bs4 import BeautifulSoupfrom book. items import BookItemclass Bookspider (scrapy. spider): name = 'book' # name allowed_domains = ['book .km.com '] # contains the domain list zurl = 'HTTP: // book.km.com/'def start_requests (self): D = ['jushi ', "xuanhuan"] # The array contains the novel types. Here we list two types of novels that you can add for I in D: # traverse url?self.zurl= I }'.html 'yield scrapy cyclically. request (url, callback = self. parse) def parse (self, response): imf = BeautifulSoup (response. text, 'lxml') B = imf. find_all ('dl ', class _ = 'info') for I in B: bookname = I. a. string author = I. dd. span. string item = BookItem () item ['name'] = bookname item ['author'] = author yield item