The scrapy framework of Python data collection __python

Source: Internet
Author: User

Scrapy is a fast screen crawl and Web crawling framework for crawling Web sites and extracting structured data from pages. Scrapy is widely used for data mining , public opinion monitoring and automated testing . 1. Scrapy profile 1.1 scrapy Overall framework

1.2 Scrapy Components

(1) engine (scrapy Engine) : Used to process data flow across the system, triggering transactions.
(2) Dispatcher (Scheduler): to accept the request from the engine, push it into the queue, and return when the engine requests it again. You can decide which URL to download next to the download and remove the duplicate URLs. The
(3) Downloader (Downloader) : Used to download Web page content and return the content of the Web page to the crawler (Spiders).
(4) crawler (Spiders) : Extracts the required information from a specific Web page. You can use it to make parsing rules for a particular page, to extract specific entities (Item) or URL links. Each spider is responsible for one or more specific websites.
(5) Project pipeline (item Pipeline) : Responsible for processing the crawler extracts from the Web page, the main function is to persist the entity, verify the validity of the entity, clear unwanted information. When the page is parsed by the crawler, it is sent to the project pipeline and processed in a few specific order.
(6) Downloader middleware (Downloader middlewares) : A child framework located between the Scrapy engine and the downloader, primarily to handle requests and responses between the Scrapy engine and the downloader.
(7) Crawler middleware (Spider middlewares) : A framework between the Scrapy engine and the crawler, the main task is to deal with the response input and request output of the crawler.
(8) Scheduling middleware (Scheduler middewares) : The middleware between the Scrapy engine and the scheduler, which handles requests and responses sent from the Scrapy engine to the schedule. 1.3 scrapy Run process

(1) The engine opens a domain name, specifies Spider to process the domain name, and lets Spider get the first URL to crawl.
(2) The engine obtains the first URL to crawl from the spider and dispatches it at the dispatcher request .
(3) The engine requests the next URL to crawl to the scheduler.
(4) Dispatch returns the next URL to crawl to the engine, the engine sends the URL to the downloader by downloading the middleware.
(5) When the Web page is downloaded, the downloader generates a response response to the page and sends it to the engine via the download middleware.
(6) The engine receives the response response from the downloader and sends it to the spider through the spider.
(7) Spider handles response responses and returns crawling to item and new request requests to the engine .
(8) The engine will crawl the item to the item Pipeline, send the new request to the scheduler.
(9) Repeat (2) after the operation, until the scheduler does not have a new request requests, the engine disconnected from the domain name link. 2. Create a simple scrapy crawler

The following is a book site for example, to write a simple scrapy crawler. 2.1 Creating Scrapy Projects

First we're going to create a scrapy project where the CD switches to the directory where you will create your work, using the scrapy startprojrect command , as follows:

Scrapy Startproject Bookspider

The Scrapy Engineering catalogue is shown in the following illustration:

2.2 Analysis Page

Before we write the crawler, we first analyze the crawl page and take the crawl page site as: http://books.toscrape.com/

(1) Locate the information location to crawl:

The information we want to crawl includes the title , book Price , and the next page link address . You can easily view page information by pressing the F12 shortcut key in the Firefox browser. Crawled page information as shown in the following illustration:

As you can see from the above picture, each book's information is in the <article class= "product_pod" > ,
The title information is in the title attribute of the h3 > a element under it,
Book Price information in its <p class= "Price_color" > text.
The URL link for the next page is in ul.pager > Li.next > A , as shown in the following illustration:

2.3 Implementation Spider

The following begins to write the crawler program , in the project Spider directory under the new bookspider.py, the source code is as follows:

#-*-Coding:utf-8-*-"" "Created on Fri June 8 14:26:12 2017 @author: Administrator" "" Import Scrapy class Bookspide R (scrapy.

    Spider): #name属性, the unique identifier of each reptile name = "Books" #定义爬虫的起始点, can be multiple start_urls = [' http://books.toscrape.com/'] 
        Def parse (self,response): "Goal: Extract book Information Analysis: Because the information of each book in the Web page is in <article class=" Product_pod ">,
            So, we use the CSS () method to find all of these article elements and iterate through the ' ' for book in Response.css (' Article.product_pod '):
            #书名在article > H3 > A element in the title attribute.
            Name = Book.xpath ('./h3/a/@title '). Extract_first () #书价信息在 <p class= "Price_color" > Text.
                    Price = Book.css (' P.price_color::text '). Extract_first () yield{' name ': Name, ' Price ':p rice,} ' target: Extract next page link analysis: Next page URL in Ul.pager > Li.next > A Face. ' NextPage = response.css (' Ul.pager li.next a::attr (HREF) '). EXtract_first () If nextpage:nextpage = Response.urljoin (nextPage) yield scrapy.
 Request (Nextpage,callback=self.parse)
2.4 Running Reptiles

Run the crawlerbooks in the newly created scrapy project directory and import the data into the books.csv file.
The results of the operation are shown in the following illustration:

The data file Book.csv Some of the data as shown in the following figure:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.