Scrapy Frame of Reptile

Source: Internet
Author: User
Tags mongodb collection xpath

A Scrapy Framework Introduction 1 Introduction

Scrapy is an open source and collaborative framework that was originally designed for page fetching (more specifically, web crawling), and it can be used to extract the required data from the site in a fast, easy, and extensible way. But the scrapy is now widely used in areas such as data mining, monitoring, and automated testing, and can be applied to get the data returned by the API (such as Amazon Associates Web Services) or a generic web crawler. Scrapy is developed based on the twisted framework, and twisted is a popular event-driven Python network framework. Therefore, Scrapy uses a non-blocking (aka asynchronous) code to implement concurrency.

The overall structure is broadly as follows:

The 
 ' components:1, Engine (Egine) engine is responsible for controlling the flow of data between all components of the system and triggering events when certain actions occur. For more information, see the Data Flow section above. 2. The scheduler (SCHEDULER) is used to accept requests sent by the engine, presses into the queue, and returns when the engine requests again. Can be imagined as a priority queue of a URL, which determines what the next URL to crawl, while removing the duplicate URL 
3, the Downloader (dowloader) is used to download the content of the Web page, and return the content of the Web page to Egine, The downloader is a
4, crawler (SPIDERS) SPIDERS that is built on the efficient asynchronous model of twisted, which is a developer-defined class that parses responses, extracts items, or sends a new request
5, Project pipeline (item Piplines) is responsible for processing the items after they are extracted, mainly including cleanup, validation, persistence (such as depositing to a database) and other Operations Downloader middleware (Downloader middlewares) between the Scrapy engine and the downloader. It is mainly used to deal with requests from Egine to Dowloader, which has been transmitted from downloader to Egine, and
You can do several things with this middleware: (1) Process a request response   Before it is sent to the Downloader (i.e. right before Scrapy sends, the request to the website);   (2) Change received response before passing it to a spider;   (3) Send a new Request instead of passing received response to a spider;   (4) Pass response to a spider without fetching a Web page; (5) Silently drop some requests.
6, crawler middleware (spider middlewares) is located between Egine and spiders, the main work is to process spiders input (that is, responses) and output (that is, requests) "

Website link

2 installation
#Windows平台

1, PIP3 install wheel #安装后, support through wheel file installation software, wheel file official website: https://www.lfd.uci.edu/~gohlke/pythonlibs
3, PIP3 install lxml
4, PIP3 install Pyopenssl
5. Download and install pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
6, download Twisted Wheel file: () http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted (in 134 days in the waves of the files)

7, the small black box cmd executes PIP3 install to follow the download (or the files of the Waves) directory \twisted-17.9.0-cp36-cp36m-win_a

3 command-line tools
# 1 View Help    scrapy-h    scrapy <command>-h# 2 There are two kinds of commands: where project-only must be cut into the project folder to execute, and the global command does not require the    global Commands:        startproject #创建项目        genspider    #创建爬虫程序        settings     #如果是在项目目录下, you get the configuration        of the project Runspider    #运行一个独立的python文件, you do not have to create        a project shell        #scrapy Shell URL address  in interactive debugging, such as selectors rules are correct or not        fetch        #独立于程单纯地爬取一个页面, the request header view #下载完毕后直接弹出浏览器 can be obtained         to identify which data is the AJAX request        version      #scrapy version View the version of Scrapy, scrapy version-v view scrapy Dependent library version    project-only commands:        crawl        #运行爬虫, you must create a project. Make sure that Robotstxt_obey = False        check        #检测项目中有无语法错误        list #列出项目中所包含的爬虫名 edit #编辑器 in the configuration file         , generally not        parse        #scrapy parse URL address--callback callback function  #以此可以验证我们的回调函数是否正确        bench        #scrapy bentch Stress Test # 3 Website Link    https://docs.scrapy.org/en/latest/topics/commands.html
4 directory Structure
"' project_name/   scrapy.cfg   project_name/       __init__.py       items.py       pipelines.py       settings.py       spiders/           __init__.py           crawler 1.py crawler           2.py           crawler 3.py "

File Description:

    • Scrapy.cfg the master configuration information for the project, used when deploying Scrapy, crawler-related configuration information in the settings.py file.
    • items.py set up a data store template for structured data, such as the Django model
    • Pipelines data processing behavior, such as: general structured persistence
    • settings.py configuration files, such as: number of recursive layers, concurrency, delayed download, and so on. Emphasis: Configuration file options must be capitalized otherwise considered invalid, the correct wording user_agent= ' xxxx '
    • Spiders crawler directories, such as: Create files, write crawler rules

Attention:

1, the general creation of crawler files, the site domain name

2, the default can only execute commands in the terminal, in order to more convenient operation:

123 #在项目根目录下新建:entrypoint.pyfromscrapy.cmdline importexecuteexecute([‘scrapy‘‘crawl‘‘xiaohua‘])

Framework Basics: Spider class, selector,

Back top two Spider class

Spiders is a class that defines how to crawl a site (or a set of sites), including how to perform crawling (that is, following a link) and how to extract structured data (that is, crawl items) from its pages. In other words, spiders is where you define custom behavior for a specific site (or in some cases, a group of sites) to crawl and resolve pages.

"' 1, generate the initial requests to crawl the first URL, and identify a callback function     the first request definition in the Start_requests () method obtains the address from the Start_urls list by default to generate the request requests,
The default callback function is the parse method. The callback function automatically fires when the download is complete and returns response 2, in the callback function, resolves response and returns a return value of 4: The dictionary that contains the parsed data item Object The new Request object (the new requests also needs to specify a callback function) or an iterative object (containing items or request) 3, parsing the page content in the callback function usually uses the selectors that the scrapy comes with, But obviously you can also use beutifulsoup,lxml or whatever you like with it. 4. Finally, the items object returned will be persisted to the database through the item pipeline component to the database: https://docs.scrapy.org/en/latest/topics/ Item-pipeline.html#topics-item-pipeline) or export to a different file (via feed exports:https://docs.scrapy.org/en/latest/topics /feed-exports.html#topics-feed-exports) ""
Back to top three selector
To explain how to use selectors, we will use the Scrapy shell (which provides interactive testing) and the sample page in the Scrapy document server, which is its HTML code:Back to top four item (project)

The main goal of crawling is to extract structured data from unstructured sources (usually web pages). Scrapy spiders can return extracted data like Python. Although convenient and familiar, p can easily enter spelling errors in field names or return inconsistent data, especially in larger projects with many spiders.

To define a common output data format, Scrapy provides the item class. The item object is a simple container for collecting crawled data. They provide a dictionary-like API and have convenient syntax for declaring their available fields.

1 declaring the project

Use simple class definition syntax and Field object declaration entries. This is an example:

1234567 import  < Code class= "Python plain" >scrapy  class   product (scrapy. Item): name  =  scrapy. Field () price  =  scrapy. Field () stock  =  scrapy. Field () last_updated  =  < Code class= "Python plain" >scrapy. Field (Serializer = str )

Note that those familiar with Django will notice that the Scrapy items are declared similar to the Django Models, except that the scrapy items are simpler because there is no concept of different field types.

2 Project Fields

The Field object is used to specify metadata for each field. For example, last_updated the serialization function for the field described in the previous example.

You can specify any type of metadata for each field. There is no limit to the value accepted by the Field object. For the same reason, there is no reference list for all available metadata keys.

Each key defined in a Field object can be used by different components, and only those components know it. You can also define and use any other key in your project based on the field's own needs.

The primary goal of the Field object is to provide a way to define all the field metadata in one place. Typically, the behavior of those components that depend on each field uses some field keys to configure the behavior.

3 Using the Project

The following are some examples of common tasks that are performed on a project using the product project declared above. You'll notice that the API is very similar to the Dict API.

+ View CODE4 Extension Project

You can extend items by declaring subclasses of the original item (to add more fields or to change some metadata for some fields).

For example:

123 classDiscountedProduct(Product):      discount_percent = scrapy.Field(serializer=str)      discount_expiration_date =scrapy.Field()
Back to top five Item PipeLine

After a project is crawled by a spider, it is sent to the project pipeline, and the project pipeline processes it through several components that are executed sequentially.

Each project pipeline component (sometimes referred to simply as "project Pipeline") is a Python class that implements a simple method. They receive a project and perform actions on it, while deciding whether the project should continue to be piped or discarded and no longer processed.

Typical uses of the project pipeline are:

    • Cleansing HTML Data
    • Validating scraped data (checking that, items contain certain fields)
    • Checking for duplicates (and dropping them)
    • Storing the scraped item in a database
1 Writing your own project pipeline
Each item pipeline component is a Python class that must implement the following methods: Process_item (self, project, spider) calls this method for each project pipeline component. Process_item () must either return a dict with data, return an item (or any descendant Class) object, return twisted deferred, or throw a Dropitem exception. Discarded items are no longer processed by other pipeline components. In addition, they can implement the following methods: Open_spider (self, spider) calls this method when the spider is opened. Close_spider (self, spider) Call this method when the spider is closed. From_crawler (Cls,crawler) if present, call such a method to create a pipeline instance from a crawler. It must return a new instance of the pipeline. The Crawler object provides access to all scrapy core components, such as settings and signals; It is a way for pipelines to access them and hook their functionality to scrapy. ‘‘‘
2 Project Pipeline Example (1) Price verification and discard items no price

Let's take a look at the hypothetical pipeline below, which adjusts the price properties of items that do not contain sales tax ( price_excludes_vat attributes) and removes items that do not contain prices:

+ View Code (2) writes an item to a JSON file

The following pipeline stores all deleted items (from all spiders) items.jl in one file, each containing an item serialized in JSON format:

+ View Code

Note that the purpose of Jsonwriterpipeline is simply to describe how to write project pipelines. If you do want to store all deleted items in a JSON file, you should use the feed export.

(3) write the project to the database

In this example, we will use Pymongo to write the project to MongoDB. The MongoDB address and database name are specified in the Scrapy settings; The MongoDB collection is named after the item class.

The point of this example is to show how to use the from_crawler() method and how to properly clean up resources:

+ View Code (4) Repeat filter

A filter that looks for duplicate items and deletes the items that have been processed. Suppose our project has a unique ID, but our spiders will return multiple items with the same ID:

+ View Code3 Activate Project pipeline component

To activate the item pipeline component, you must add its class to the ITEM_PIPELINES settings, as shown in the following example:

1234 ITEM_PIPELINES ={    ‘myproject.pipelines.PricePipeline‘300,    ‘myproject.pipelines.JsonWriterPipeline‘800,}

The integer values that you assign to classes in this setting determine what they are running?? Order: The item is from a lower value to a higher value class. It is customary to define these numbers within the 0-1000 range.

Back to top six download middleware
Class Mydownmiddleware (object): Def process_request (self, request, spider): "" "when the request needs to be downloaded, it passes through all the downloader middleware Proce Ss_request Call:p Aram request::p Aram Spider:: Return:none, continue the subsequent middleware to download; respo The NSE object, which stops execution of the process_request, starts executing the Process_response request object, stops the execution of the middleware, and raise the request re-scheduler Ignorerequest Exception, stop process_request execution, start execution process_exception "" "Pass Def Process_response (self, request, response, SPID ER): "" "spider processing completed, called on return:p Aram response::p Aram Result::p Aram Spider:: Return : Response object: Transfer to other middleware Process_response Request object: Stop middleware, request will be re-dispatched download raise ignorerequ EST exception: Call request.errback "" "Print (' response1 ') return response def process_exception (self, reques        T, exception, Spider): "" "The current load processor (download handler) or process_request () (download middleware) throws an exception:p Aram response: :p Aram Exception:       :p Aram Spider:: Return:none: Continue to the next middleware to handle exceptions; Response object: Stop subsequent process_exception methods Request object: Stop middleware, request will be re-invoked download "" "Return None
Back to top of the seven settings configurationView CodeBack to top eight project code

Download Project Code

Scrapy Frame of Reptile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.