A way of python crawling with salted fish (v): Scrapy Reptile Frame

Source: Internet
Author: User

Introduce the Scrapy crawler frame

The installation method pip install Scrapy can be implemented. I use the Anaconda command to install scrapy for Conda.

1 engine get crawl request from spider
2Engine forwarding a crawl request to scheduler for scheduling

3 engine gets the next request to crawl from scheduler
4 engine sends crawl requests through middleware to downloader
5 after crawling the Web page, downloader form a response (Response) to the engine through the middleware
6 engine sends the received response through the middleware to spider processing engine to forward the crawl request to scheduler for dispatch

7 Spider generates a crawl item after processing a response (scraped item)
And a new crawl request (requests) to the engine
8 engine sends a crawl item to item Pipeline (frame exit)
9 engine sends a crawl request to scheduler

The engine controls the data flow of each module, continuously from the scheduler
Get a crawl request until the request is empty
Frame entry: The spider's initial crawl request
Frame exit: Item Pipeline

Engine
(1) Control of data flow between all modules
(2) Triggering events based on conditions
No user modification required

Downloader
Download Web pages on request
No user modification required

Scheduler
Scheduling management of all crawl requests
No user modification required

Downloader Middleware
Objective: To implement engine, scheduler and downloader
User-configurable controls between
Features: Modify, discard, add request or response
User can write configuration code

Spider
(1) Parsing the response returned by Downloader (Response)
(2) Generating a crawl item (scraped item)
(3) Generate additional crawl requests (request)
Requires user to write configuration code

Item Pipelines
(1) Handling crawl items generated by spiders in a pipelined manner
(2) consists of a set of sequence of operations, similar to the pipeline, each operating
is an item pipeline type
(3) Possible actions include: cleaning, checking, and checking for crawling items
HTML data, storing the data in a database
Requires user to write configuration code

After understanding the basic concepts, we began to write the first scrapy crawler.

Start with a new crawler project scrapy startproject xxx (project name)

This crawler simply crawls the title and author of a novel website.

We have now created the Crawler Project book now to edit his configuration

Level Two book directory these are the configuration files described above, before modifying these

We now create a start.py in the first level book directory to use for scrapy crawlers in the IDE

The face runs. Write the following code inside the file.

The first two parameters are fixed, and the third parameter is the name of your spider.

Next we'll fill in the fields in items:

Then create the crawler main program in the Spider book.py

We are going to crawl the site for http://book.km.com/

By clicking on the site different kinds of fiction will find the website address is http://book.km.com/+ novel kind of pinyin. html

Through this we'll write the contents of the Read page

After we get this, we use the parse function to parse the retrieved Web page and extract the required information.

Page analysis extract data is through the BeautifulSoup library, here is a little bit. Own analysis 2333~

The program is written so we're going to store the crawled information and edit the pipelines.py.

Two storage options available here

1 Save As TXT text

2 Deposit into the database

To make this run successfully we also need to configure the setting.py in the

Item_pipelines = {    ' book.pipelines.xxx ': 300,}
XXX is the class name of the storage method, you want to use what method to store it to the name of the good run results nothing to be expected
This is the first reptile frame. The end of the busy time to continue to improve the crawler after the time to complete the crawler to the content of the novel to climb down the program to share a wave.
A complete code for the book is attached:
Import Scrapy fromBS4 Import BeautifulSoup frombook.items Import BookitemclassBookspider (scrapy. Spider): Name=' Book'#名字 allowed_domains= ['book.km.com'] #包含了spider允许爬取的域名 (domain) (list) Zurl='http://book.km.com/'def start_requests (self): D=['Jushi','Xuanhuan'] #数组里面包含了小说种类这里列举两个有需要可以自己添加 forIinchD: #通过循环遍历 URL=self.zurl+i+'. html'            yieldScrapy. Request (URL, callback=self.parse) def parse (self, Response): IMF=beautifulsoup (Response.text,'lxml') b=imf.find_all ('DL', class_='Info')         forIinchB:bookname=i.a.stringauthor= I.dd.span.stringItem=Bookitem () item['name'] =bookname item['author'] =authoryieldItem

A way of python crawling with salted fish (v): Scrapy Reptile Frame

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.