Python crawler path of a salted fish (5): scrapy crawler framework, pythonscrapy

Last Update:2017-06-18 Source: Internet

Author: User

Tags domain list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler path of a salted fish (5): scrapy crawler framework, pythonscrapy

Introduction to scrapy crawler framework

Installation Method pip install scrapy. I use the anaconda command to install scrapy for conda.

1. The Engine obtains a Request from the Spider)
2Engine forwards the crawling request to Scheduler for scheduling.

3 Engine obtains the next request to be crawled from Scheduler
4. The Engine sends the crawling request to Downloader through middleware.
5. After crawling the webpage, Downloader generates a Response (Response) sent to the Engine through Middleware
6. The Engine sends the received response to the Spider through middleware to process the Engine and forward the crawler request to Scheduler for scheduling.

7. The Spider generates a crawler Item after processing the response (scraped Item)
And the new crawler request (Requests) to the Engine
8. The Engine sends the crawling Item to the Item Pipeline (Framework exit)
9 The Engine sends the crawling request to Scheduler

The Engine controls the data streams of each module from Scheduler without interruption.
Obtain the crawling request until the request is empty.
Framework entry: Spider's initial crawling request
Framework Exit: Item Pipeline

Engine
(1) control data streams between all modules
(2) trigger events based on conditions
No user modification required

Downloader
Download webpage as requested
No user modification required

Scheduler
Schedule and manage all crawling requests
No user modification required

Downloader Middleware
Objective: To implement Engine, schedloader, and Downloader
User-configurable Control
Function: Modify, discard, add a request or response
Users can write configuration code

Spider
(1) parse the Response returned by Downloader (Response)
(2) generate a crawler item (scraped item)
(3) generating additional crawling requests)
You need to write the configuration code

Item Pipelines
(1) process crawler items generated by Spider in pipeline Mode
(2) consists of a group of Operation Sequence, similar to the pipeline, each operation
Operation is an Item Pipeline type
(3) possible operations include cleaning, checking, and re-crawling items
HTML data, store the data to the database
You need to write the configuration code

After learning about the basic concepts, let's start writing the first scrapy crawler.

Create a crawler project scrapy startproject xxx (project name)

This crawler simply crawls the title and author of a novel website.

Now we have created a crawler project book. Now we can edit its configuration.

In the second-level book directory, these are the configuration files that have been introduced above. Before you modify these files

We now create a start. py file in the first-level book directory for scrapy crawlers

Run. Write the following code in the file.

The first two parameters are fixed, and the third parameter is your spider name.

Next we will fill in the field in items:

Create the crawler main program book. py in the spider.

The website we want to crawl is http://book.km.com/

By clicking different types of novels on the website, you will find that the website address is http://book.km.com/? .html.

Through this, we can write and read the content of the webpage.

After this result is obtained, we use the parse function to parse the obtained webpage and extract the required information.

Webpage analysis extracts data through the BeautifulSoup library, which is omitted here. Analyze by yourself 2333 ~

To store the crawled information, you need to edit Pipelines. py.

Two storage methods are provided here.

1. Save as txt text

2. store data in the database

To make this run successfully, we also need to configure it in setting. py.

ITEM_PIPELINES = {'book. pipelines. XXX': 300 ,}
Xxx is the class name of the storage method. If you want to use any method to store the class name, you can change it to that name, and the result will be omitted if there is nothing to look.
The first crawler framework is like this. At the end of the term, there is no time to continue to improve the crawler. After that, there is time to complete the crawler into a program that crawls the novel content and so on.
The complete code of a book is provided:

Import scrapyfrom bs4 import BeautifulSoupfrom book. items import BookItemclass Bookspider (scrapy. spider): name = 'book' # name allowed_domains = ['book .km.com '] # contains the domain list zurl = 'HTTP: // book.km.com/'def start_requests (self): D = ['jushi ', "xuanhuan"] # The array contains the novel types. Here we list two types of novels that you can add for I in D: # traverse url?self.zurl= I }'.html 'yield scrapy cyclically. request (url, callback = self. parse) def parse (self, response): imf = BeautifulSoup (response. text, 'lxml') B = imf. find_all ('dl ', class _ = 'info') for I in B: bookname = I. a. string author = I. dd. span. string item = BookItem () item ['name'] = bookname item ['author'] = author yield item

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More