A way of python crawling with salted fish (v): Scrapy Reptile Frame

Last Update:2017-06-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduce the Scrapy crawler frame

The installation method pip install Scrapy can be implemented. I use the Anaconda command to install scrapy for Conda.

1 engine get crawl request from spider
2Engine forwarding a crawl request to scheduler for scheduling

3 engine gets the next request to crawl from scheduler
4 engine sends crawl requests through middleware to downloader
5 after crawling the Web page, downloader form a response (Response) to the engine through the middleware
6 engine sends the received response through the middleware to spider processing engine to forward the crawl request to scheduler for dispatch

7 Spider generates a crawl item after processing a response (scraped item)
And a new crawl request (requests) to the engine
8 engine sends a crawl item to item Pipeline (frame exit)
9 engine sends a crawl request to scheduler

The engine controls the data flow of each module, continuously from the scheduler
Get a crawl request until the request is empty
Frame entry: The spider's initial crawl request
Frame exit: Item Pipeline

Engine
(1) Control of data flow between all modules
(2) Triggering events based on conditions
No user modification required

Downloader
Download Web pages on request
No user modification required

Scheduler
Scheduling management of all crawl requests
No user modification required

Downloader Middleware
Objective: To implement engine, scheduler and downloader
User-configurable controls between
Features: Modify, discard, add request or response
User can write configuration code

Spider
(1) Parsing the response returned by Downloader (Response)
(2) Generating a crawl item (scraped item)
(3) Generate additional crawl requests (request)
Requires user to write configuration code

Item Pipelines
(1) Handling crawl items generated by spiders in a pipelined manner
(2) consists of a set of sequence of operations, similar to the pipeline, each operating
is an item pipeline type
(3) Possible actions include: cleaning, checking, and checking for crawling items
HTML data, storing the data in a database
Requires user to write configuration code

After understanding the basic concepts, we began to write the first scrapy crawler.

Start with a new crawler project scrapy startproject xxx (project name)

This crawler simply crawls the title and author of a novel website.

We have now created the Crawler Project book now to edit his configuration

Level Two book directory these are the configuration files described above, before modifying these

We now create a start.py in the first level book directory to use for scrapy crawlers in the IDE

The face runs. Write the following code inside the file.

The first two parameters are fixed, and the third parameter is the name of your spider.

Next we'll fill in the fields in items:

Then create the crawler main program in the Spider book.py

We are going to crawl the site for http://book.km.com/

By clicking on the site different kinds of fiction will find the website address is http://book.km.com/+ novel kind of pinyin. html

Through this we'll write the contents of the Read page

After we get this, we use the parse function to parse the retrieved Web page and extract the required information.

Page analysis extract data is through the BeautifulSoup library, here is a little bit. Own analysis 2333~

The program is written so we're going to store the crawled information and edit the pipelines.py.

Two storage options available here

1 Save As TXT text

2 Deposit into the database

To make this run successfully we also need to configure the setting.py in the

Item_pipelines = {    ' book.pipelines.xxx ': 300,}
XXX is the class name of the storage method, you want to use what method to store it to the name of the good run results nothing to be expected
This is the first reptile frame. The end of the busy time to continue to improve the crawler after the time to complete the crawler to the content of the novel to climb down the program to share a wave.
A complete code for the book is attached:

Import Scrapy fromBS4 Import BeautifulSoup frombook.items Import BookitemclassBookspider (scrapy. Spider): Name=' Book'#名字 allowed_domains= ['book.km.com'] #包含了spider允许爬取的域名 (domain) (list) Zurl='http://book.km.com/'def start_requests (self): D=['Jushi','Xuanhuan'] #数组里面包含了小说种类这里列举两个有需要可以自己添加 forIinchD: #通过循环遍历 URL=self.zurl+i+'. html'            yieldScrapy. Request (URL, callback=self.parse) def parse (self, Response): IMF=beautifulsoup (Response.text,'lxml') b=imf.find_all ('DL', class_='Info')         forIinchB:bookname=i.a.stringauthor= I.dd.span.stringItem=Bookitem () item['name'] =bookname item['author'] =authoryieldItem

A way of python crawling with salted fish (v): Scrapy Reptile Frame

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A way of python crawling with salted fish (v): Scrapy Reptile Frame

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A way of python crawling with salted fish (v): Scrapy Reptile Frame

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support