A preliminary knowledge of Python frame scrapy (i.)

Last Update:2016-12-22 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy Introduction

Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
The so-called web crawler, is a web-based or directed to crawl data procedures, of course, this is not professional, more professional description is to crawl specific Web pages of HTML data. The general way to crawl Web pages is to define a portal page, and then generally a page will have other pages of the URL, so from the current page to get these URLs added to the crawler crawl queue, and then go to the new page and then recursively do the above, in fact, it is the same as depth traversal or breadth traversal.
Scrapy uses twisted this asynchronous network library to handle network communication, the structure is clear, and contains a variety of middleware interfaces, you can flexibly complete a variety of requirements.

Overall architecture

The Scrapy engine is used to handle the data flow processing of the entire system, triggering transactions. The Scheduler (Scheduler), which accepts requests sent by the engine, presses into the queue and returns when the engine requests it again. Downloader (Downloader) for downloading Web content and returning the contents of the Web page to the spider. Spider (Spiders), the spider is the main work, use it to develop specific domain names or Web page parsing rules. Write a class that parses the response and extracts the item (that is, the item that gets to it) or the additional follow-up URL. Each spider is responsible for processing a specific (or some) Web site. Project pipeline (item Pipeline), responsible for handling projects with spiders extracted from web pages, his main task is to clear, validate and store data. When the page is parsed by the spider, it is sent to the project pipeline, and the data is processed in several specific order. The Downloader middleware (Downloader middlewares), located in the hook framework between the Scrapy engine and the downloader, mainly handles requests and responses between the Scrapy engine and the downloader. Spider Middlewares, a hook frame between the scrapy engine and the spider, works mainly to deal with the spider's response input and request output. Dispatch middleware (Scheduler middlewares), a middleware between the scrapy engine and scheduling, is sent from the Scrapy engine to the scheduled request and response.

Crawl process

The Green Line is the data flow, first starting from the initial URL, scheduler will give it to downloader to download, download will be given to the spider for analysis, spider analysis of the results are two: one is to further crawl the link, such as the previous analysis of the "next page" Links, these things will be sent back to scheduler, and the other is the data that needs to be saved, and they are delivered to item pipeline, which is a place for post-processing (detailed analysis, filtering, storage, etc.) of the data. In addition, in the data flow channel can also install a variety of middleware, to do the necessary processing.

Data flow

The data flow in Scrapy is controlled by the execution engine, with the following process:

1. The engine opens a website (open a domain), finds the spider that processes the site, and requests the first URL to crawl (s) to the spider. 2. The engine gets the first URL to crawl from the spider and dispatches it to request at the Scheduler (Scheduler). 3. The engine requests the next URL to be crawled to the scheduler. 4. The scheduler returns the next URL to crawl to the engine, and the engine forwards the URL to the downloader (Downloader) by downloading the middleware (request) direction. 5. Once the page has been downloaded, the downloader generates a response of the page and sends it to the engine via the download middleware (return (response) direction). 6. The engine receives the response from the downloader and sends it to spider processing via the spider middleware (input direction). 7.Spider processes the response and returns the crawled to the item and (follow up) the new request to the engine. 8. The engine crawls the item (which the spider returns) to item Pipeline, and the request to the scheduler (which is returned by the spider). 9. (from the second step) repeat until the scheduler does not have more request, the engine shuts down the site.

Scrapy Project Basic Process default Scrapy project structure

Create the project using the Global command Startproject and create a scrapy project named Project_Name under the Project_Name folder.

Scrapy Startproject MyProject

The Scrapy project defaults to a file structure similar to the following:

scrapy.cfgmyproject/    __init__.py    items.py    pipelines.py    settings.py    spiders/        __init__.py        spider1.py        spider2.py        ...

The directory where the scrapy.cfg is stored is considered to be the root directory of the project. The field that contains the Python module name in the file defines the settings for the project.

Define the data to crawl

Item is a container for saving crawled data, is similar to a Python dictionary, and provides additional protection against undefined field errors caused by spelling errors.
Similar to what you do in an ORM, you can create a scrapy. The Item class, and the definition type is scrapy. Field's class attribute to define an item.
First, as needed, the data obtained from the dmoz.org (DMOZ website is a well-known open catalogue (open Directoryproject), the largest global directory community that volunteers from around the world co-maintains and builds, to model item. We need to get the name, URL, and description of the site from DMOZ. For this, the corresponding fields are defined in item. Edit items.py File:

import scrapyclass DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()

To create a spider using the project command Genspider

Scrapy Genspider [-t template] <name> <domain>

Creates a spider in the current project.
This is just a quick way to create a spider. This method can be used to create a spider using a template that is defined in advance. You can also create your own spider's source files.

$ scrapy genspider-lavailable templates:basic crawl csvfeed xmlfeed$ s Crapy genspider-d basicimport scrapy class $classname (scrapy. Spider): name =  "$name" allowed_domains = [ "$domain"] Start_urls = ( http://www. $domain/',) def Span class= "Hljs-title" >parse (self, response): pass$ scrapy Genspider-t Basic Example example.comcreated spider  ' example ' using template  basic ' in module:mybot.spiders.example

To write the spider that extracts the item data

Spiders are classes that users write to crawl data from a single site (or some web site).
It contains an initial URL for the download, how to follow the links in the page, and how to analyze the contents of the page to extract the method that generated the item.
In order to create a spider, you must inherit scrapy. Spider class, and defines the following three properties:

Name: Used to differentiate the spider. The name must be unique and you cannot set the same name for different spiders.

Start_urls: Contains a list of URLs that spiders crawl at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL.

Parse () is a method of the spider. When called, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response data), extracting it (generating item), and generating a request object that requires further processing of the URL.

Import Scrapyclass dmozspider  (scrapy.spider.Spider): name =  "DMOZ"  #唯一标识, this name is specified when the spider is started allowed_domains = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse  "/") [-2] with open (filename,  ' WB ') as f:f.write (response.body)

To crawl

Execute Project command crawl, start spider:

Scrapy Crawl DMOZ

In this process:
Scrapy creates scrapy for each URL in the Start_urls property of the spider. The request object and assigns the parse method as a callback function (callback) to the request.
The request object is dispatched to execute the generated Scrapy.http.Response object and sent back to the Spider Parse () method.

Extracting data from selectors

Introduction to the selectors selector:
Scrapy extract data has its own set of mechanisms. They are called selectors (seletors) because they "select" a portion of the HTML file through a specific XPath or CSS expression.
XPath is a language used to select nodes in an XML file, or it can be used on HTML. CSS is an HTML document that is styled in a language. Selectors are defined by it and are related to the style of a particular HTML element.

Examples and meanings of XPath expressions:

/html/head/title: Select the <title> element within the

/html/head/title/text (): Select the text of the <title> element mentioned above

TD: Select all the <td> elements

div[@class = "Mine"]: Select all DIV elements that have class= "mine" attribute

Extract data:
Observe the HTML source code and determine the appropriate XPath expression.
After viewing the source of the Web page, you will find that the information of the website is included in the second

element.

Elements:
Response.xpath ('//ul/li ')
The Item object is a custom Python dictionary. You can use the standard dictionary syntax to get the value to each of its fields.
In general, the spider will return the crawled data to the Item object. So in order to return the crawled data, our final code would be:
```
Import ScrapyFrom Tutorial.itemsImport DmozitemClassDmozspider(Scrapy. Spider): name ="DMOZ" allowed_domains = [ "dmoz.org"] start_urls = [" http://www.dmoz.org/Computers/Programming/ Languages/python/resources/"] def parse (self, response): for sel in Response.xpath ( '//ul/li '): item = Dmozitem () item[ ' title '] = Sel.xpath ( ' Link ' = Sel.xpath ( @href '). Extract () Item[ ' desc '] = Sel.xpath (yield item   
```
Crawling the dmoz.org now produces the Dmozitem object.
Save data
The simplest way to store crawled data is to use the Feed exports:

Scrapy Crawl Dmoz-o Items.json

The command serializes the crawled data in JSON format, generating a Items.json file.
If you need to do more complex operations on the crawled item, you can write item Pipeline. Similar to what we did with item when we created the project, the tutorial/pipelines.py that you wrote your own was also created. But if you just want to save the item, you don't need to implement any pipeline.
Supplemental Tip: Special requirements for Windows platform installation scrapy
Under the Windows platform, the first thing to do before installing scrapy is to do the following:
- Installing OpenSSL
  Download the installation of Visual C + + Redistributables and the corresponding OpenSSL installation package in the Win32 OpenSSL page and add its executable directory "*\openssl-win32\bin" to the environment variable path
- Installing Scrapy-dependent binary packages
  Pywin32
  Twisted
  Zope.interface
  lxml
  Pyopenssl

This article turns from: HTTP://WWW.JIANSHU.COM/P/A8AAD3BF4DC4, thanks to the author for sharing.

A preliminary knowledge of Python frame scrapy (i.)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More