Python crawler path of a salted fish (5): scrapy crawler framework, pythonscrapy

Source: Internet
Author: User
Tags domain list

Python crawler path of a salted fish (5): scrapy crawler framework, pythonscrapy

Introduction to scrapy crawler framework

Installation Method pip install scrapy. I use the anaconda command to install scrapy for conda.

 

 

 

1. The Engine obtains a Request from the Spider)
2Engine forwards the crawling request to Scheduler for scheduling.

3 Engine obtains the next request to be crawled from Scheduler
4. The Engine sends the crawling request to Downloader through middleware.
5. After crawling the webpage, Downloader generates a Response (Response) sent to the Engine through Middleware
6. The Engine sends the received response to the Spider through middleware to process the Engine and forward the crawler request to Scheduler for scheduling.

7. The Spider generates a crawler Item after processing the response (scraped Item)
And the new crawler request (Requests) to the Engine
8. The Engine sends the crawling Item to the Item Pipeline (Framework exit)
9 The Engine sends the crawling request to Scheduler

The Engine controls the data streams of each module from Scheduler without interruption.
Obtain the crawling request until the request is empty.
Framework entry: Spider's initial crawling request
Framework Exit: Item Pipeline

 

 

 

 

 

 

Engine
(1) control data streams between all modules
(2) trigger events based on conditions
No user modification required

Downloader
Download webpage as requested
No user modification required

Scheduler
Schedule and manage all crawling requests
No user modification required

Downloader Middleware
Objective: To implement Engine, schedloader, and Downloader
User-configurable Control
Function: Modify, discard, add a request or response
Users can write configuration code

Spider
(1) parse the Response returned by Downloader (Response)
(2) generate a crawler item (scraped item)
(3) generating additional crawling requests)
You need to write the configuration code

Item Pipelines
(1) process crawler items generated by Spider in pipeline Mode
(2) consists of a group of Operation Sequence, similar to the pipeline, each operation
Operation is an Item Pipeline type
(3) possible operations include cleaning, checking, and re-crawling items
HTML data, store the data to the database
You need to write the configuration code

After learning about the basic concepts, let's start writing the first scrapy crawler.

Create a crawler project scrapy startproject xxx (project name)

This crawler simply crawls the title and author of a novel website.

Now we have created a crawler project book. Now we can edit its configuration.

In the second-level book directory, these are the configuration files that have been introduced above. Before you modify these files

We now create a start. py file in the first-level book directory for scrapy crawlers

Run. Write the following code in the file.

The first two parameters are fixed, and the third parameter is your spider name.

 

Next we will fill in the field in items:

Create the crawler main program book. py in the spider.

The website we want to crawl is http://book.km.com/

By clicking different types of novels on the website, you will find that the website address is http://book.km.com/? .html.

Through this, we can write and read the content of the webpage.

After this result is obtained, we use the parse function to parse the obtained webpage and extract the required information.

Webpage analysis extracts data through the BeautifulSoup library, which is omitted here. Analyze by yourself 2333 ~

To store the crawled information, you need to edit Pipelines. py.

Two storage methods are provided here.

1. Save as txt text

2. store data in the database

To make this run successfully, we also need to configure it in setting. py.

ITEM_PIPELINES = {'book. pipelines. XXX': 300 ,}
Xxx is the class name of the storage method. If you want to use any method to store the class name, you can change it to that name, and the result will be omitted if there is nothing to look.
The first crawler framework is like this. At the end of the term, there is no time to continue to improve the crawler. After that, there is time to complete the crawler into a program that crawls the novel content and so on.
The complete code of a book is provided:
Import scrapyfrom bs4 import BeautifulSoupfrom book. items import BookItemclass Bookspider (scrapy. spider): name = 'book' # name allowed_domains = ['book .km.com '] # contains the domain list zurl = 'HTTP: // book.km.com/'def start_requests (self): D = ['jushi ', "xuanhuan"] # The array contains the novel types. Here we list two types of novels that you can add for I in D: # traverse url?self.zurl= I }'.html 'yield scrapy cyclically. request (url, callback = self. parse) def parse (self, response): imf = BeautifulSoup (response. text, 'lxml') B = imf. find_all ('dl ', class _ = 'info') for I in B: bookname = I. a. string author = I. dd. span. string item = BookItem () item ['name'] = bookname item ['author'] = author yield item
 

 

 
 

 

 

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.