The scrapy framework of Python data collection _

The scrapy framework of Python data collection __python

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy is a fast screen crawl and Web crawling framework for crawling Web sites and extracting structured data from pages. Scrapy is widely used for data mining , public opinion monitoring and automated testing . 1. Scrapy profile 1.1 scrapy Overall framework

1.2 Scrapy Components

(1) engine (scrapy Engine) : Used to process data flow across the system, triggering transactions.
(2) Dispatcher (Scheduler): to accept the request from the engine, push it into the queue, and return when the engine requests it again. You can decide which URL to download next to the download and remove the duplicate URLs. The
(3) Downloader (Downloader) : Used to download Web page content and return the content of the Web page to the crawler (Spiders).
(4) crawler (Spiders) : Extracts the required information from a specific Web page. You can use it to make parsing rules for a particular page, to extract specific entities (Item) or URL links. Each spider is responsible for one or more specific websites.
(5) Project pipeline (item Pipeline) : Responsible for processing the crawler extracts from the Web page, the main function is to persist the entity, verify the validity of the entity, clear unwanted information. When the page is parsed by the crawler, it is sent to the project pipeline and processed in a few specific order.
(6) Downloader middleware (Downloader middlewares) : A child framework located between the Scrapy engine and the downloader, primarily to handle requests and responses between the Scrapy engine and the downloader.
(7) Crawler middleware (Spider middlewares) : A framework between the Scrapy engine and the crawler, the main task is to deal with the response input and request output of the crawler.
(8) Scheduling middleware (Scheduler middewares) : The middleware between the Scrapy engine and the scheduler, which handles requests and responses sent from the Scrapy engine to the schedule. 1.3 scrapy Run process

(1) The engine opens a domain name, specifies Spider to process the domain name, and lets Spider get the first URL to crawl.
(2) The engine obtains the first URL to crawl from the spider and dispatches it at the dispatcher request .
(3) The engine requests the next URL to crawl to the scheduler.
(4) Dispatch returns the next URL to crawl to the engine, the engine sends the URL to the downloader by downloading the middleware.
(5) When the Web page is downloaded, the downloader generates a response response to the page and sends it to the engine via the download middleware.
(6) The engine receives the response response from the downloader and sends it to the spider through the spider.
(7) Spider handles response responses and returns crawling to item and new request requests to the engine .
(8) The engine will crawl the item to the item Pipeline, send the new request to the scheduler.
(9) Repeat (2) after the operation, until the scheduler does not have a new request requests, the engine disconnected from the domain name link. 2. Create a simple scrapy crawler

The following is a book site for example, to write a simple scrapy crawler. 2.1 Creating Scrapy Projects

First we're going to create a scrapy project where the CD switches to the directory where you will create your work, using the scrapy startprojrect command , as follows:

Scrapy Startproject Bookspider

The Scrapy Engineering catalogue is shown in the following illustration:

2.2 Analysis Page

Before we write the crawler, we first analyze the crawl page and take the crawl page site as: http://books.toscrape.com/

(1) Locate the information location to crawl:

The information we want to crawl includes the title , book Price , and the next page link address . You can easily view page information by pressing the F12 shortcut key in the Firefox browser. Crawled page information as shown in the following illustration:

As you can see from the above picture, each book's information is in the <article class= "product_pod" > ,
The title information is in the title attribute of the h3 > a element under it,
Book Price information in its <p class= "Price_color" > text.
The URL link for the next page is in ul.pager > Li.next > A , as shown in the following illustration:

2.3 Implementation Spider

The following begins to write the crawler program , in the project Spider directory under the new bookspider.py, the source code is as follows:

#-*-Coding:utf-8-*-"" "Created on Fri June 8 14:26:12 2017 @author: Administrator" "" Import Scrapy class Bookspide R (scrapy.

    Spider): #name属性, the unique identifier of each reptile name = "Books" #定义爬虫的起始点, can be multiple start_urls = [' http://books.toscrape.com/'] 
        Def parse (self,response): "Goal: Extract book Information Analysis: Because the information of each book in the Web page is in <article class=" Product_pod ">,
            So, we use the CSS () method to find all of these article elements and iterate through the ' ' for book in Response.css (' Article.product_pod '):
            #书名在article > H3 > A element in the title attribute.
            Name = Book.xpath ('./h3/a/@title '). Extract_first () #书价信息在 <p class= "Price_color" > Text.
                    Price = Book.css (' P.price_color::text '). Extract_first () yield{' name ': Name, ' Price ':p rice,} ' target: Extract next page link analysis: Next page URL in Ul.pager > Li.next > A Face. ' NextPage = response.css (' Ul.pager li.next a::attr (HREF) '). EXtract_first () If nextpage:nextpage = Response.urljoin (nextPage) yield scrapy.
 Request (Nextpage,callback=self.parse)

2.4 Running Reptiles

Run the crawlerbooks in the newly created scrapy project directory and import the data into the books.csv file.
The results of the operation are shown in the following illustration:

The data file Book.csv Some of the data as shown in the following figure:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The scrapy framework of Python data collection __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The scrapy framework of Python data collection __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support