A Brief Introduction to the python open-source crawling framework and solutions to common problems during installation (UBUNTU)

Source: Internet
Author: User

Some content is transferred from: http://www.kuqin.com/system-analysis/20110906/264417.html

I. Overview

Shows the general architecture of scrapy, including its main components and the data processing process of the system (green arrow shows ). The following describes the functions of each component and the data processing process.

Ii. Components

1. scrapy engine (scrapy engine)

The scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, see the following data processing process.

2. sched)

The scheduler receives and sorts requests from the scrapy engine into the queue, and returns the requests to the scrapy engine after sending the requests.

3. downloader)

The main responsibility of the download tool is to capture the webpage and return the webpage content to spiders ).

4. Spiders)

A spider is a class defined by scrapy users to parse webpages and capture the content returned by URLs. Each spider can process a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.

The entire crawling process (cycle) of a spider is as follows:

  1. First, obtain the initial request of the first URL. When the request returns, retrieve a callback function. The first request is to call the start_requests () method. By default, this method generates a request from the URL in start_urls and performs resolution to call the callback function.
  2. In the callback function, you can parse the webpage response and return the iterations of the Project object and request object or the two. These requests will also contain a callback, Which is downloaded by scrapy and then processed by the specified callback.
  3. In the callback function, you parse the website content and use the XPath selector in the same process (but you can also use beautifusoup, lxml or any other programs you like) and generate parsed data items.
  4. Finally, projects returned from the spider are usually placed in the project pipeline.

5. Item pipeline (project pipeline)

The main responsibility of the project pipeline is to process projects extracted from webpages by Spider. Its main task is to clarify, verify, and store data. After the page is parsed by a spider, it will be sent to the project pipeline and processed in several specific order. The components of each project pipeline are a python class consisting of a simple method. They get the project and execute their methods, and they also need to determine whether to continue to execute the next step in the project pipeline or directly discard it for non-processing.

The project pipeline generally performs the following processes:

  1. Clean HTML data
  2. Verify the parsed data (check whether the project contains necessary fields)
  3. Check whether the data is duplicated (delete the data if it is repeated)
  4. Store parsed data in the database

6. downloader middlewares (downloader middleware)

Download middleware is a hook framework between the scrapy engine and the download tool. It mainly processes requests and responses between the scrapy engine and the download tool. It provides a custom code to expand scrapy functions. The download intermediary is a hook framework for processing requests and responses. It is a lightweight underlying system that allows scrapy to enjoy global control.

7. Spider middlewares (Spider middleware)

Spider middleware is a hook framework between the scrapy engine and the spider. It mainly processes the spider's response input and request output. It provides a way to customize code to expand scrapy functions. Spider middleware is a framework of spider processing mechanisms attached to scrapy. you can insert custom code to send requests to SPIDER and return the response content and project obtained by the spider.

8. scheduler middlewares (scheduling middleware)

Scheduling middleware is a middleware between the scrapy engine and scheduling. It mainly serves to send scheduling requests and responses from the scrapy engine. It provides a custom code to expand the scrapy function.

3. Data Processing Process

The entire data processing process of scrapy is controlled by the scrapy engine. The main operation mode is as follows:

  1. When the engine opens a domain name, the spider processes the domain name and asks the spider to obtain the first crawled URL.
  2. The engine obtains the first URL to be crawled from the spider and then schedules the request as a request in scheduling.
  3. The engine obtains the page for crawling from the scheduling.
  4. The scheduler returns the next crawled URL to the engine, which sends them to the downloader through the download middleware.
  5. After the webpage is downloaded by the download loader, the response content is sent to the engine through the download middleware.
  6. The engine receives a response from the download tool and sends it to the spider through the spider middleware for processing.
  7. The spider processes the response and returns the crawled project, and then sends a new request to the engine.
  8. The engine captures the project pipeline and sends a request to the scheduler.
  9. The system repeats the operation next to the second part until there is no request in the scheduling, and then disconnects the connection between the engine and the domain.

4. Drive

Scrapy is a popular Python event-driven network framework written by twisted. It uses non-congested asynchronous processing. For more information about asynchronous programming and twisted, see the following two links.

5. Installation errors and solutions:

Note: trying to build without cython, pre-generated 'src/lxml. etree. c' needs to be available.
Error:/bin/sh: XSLT-config: not found

** Make sure the development packages of libxml2 and libxslt are installed **

Using build configuration of libxslt
Src/lxml. etree. C: 4: Fatal error: Python. H: No file or directory
Compilation terminated.
Error: Setup script exited with error: Command 'gcc 'failed with exit status 1

Solution:

sudo apt-get install gccsudo apt-get install python-devsudo apt-get install libxml2 libxml2-devsudo apt-get install libxslt1.1 libxslt1-dev

Run the installation command again:

sudo easy_install -U -Scrapysudo pip install Scrapy

Vi. References:

1. Official website documents: http://scrapy.org/doc/

2. Sample Code: http://snipplr.com/all/tags/scrapy/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.