Python3 Distributed crawler

Source: Internet
Author: User
Tags xpath python script redis server

Background

Department (Oriental IC, graphic worm) business-driven, need to collect a large number of picture resources, do data analysis, as well as genuine image rights. First, the main use node to do the crawler (business is relatively simple, more familiar with node). With the change of business demand, large-scale crawler encounters various problems. The Python crawler has the innate superiority, the community resources are quite complete, the various frameworks also perfect support. Crawler performance has also been greatly improved. This sharing from the basic knowledge, involving two Python framework Pyspider, Scrapy, and based on Scrapy, Scrapy-redis do a distributed crawler introduction (directly pasted PPT) will involve Redis, MongoDB and other related knowledge.

For anti-theft chain (automatic login, auto-registration ...) and common strategies), proxies, crawler snapshots, object resources into the TOS have not been introduced too much. This two month we are doing the visualization of the crawler platform, configurable, and process-based.

First, what is the Frontier 1.1 crawler?

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.

1.2 Why is Python?

Easy to learn: simple enough to not learn any programming language people to read a little bit of information can be written out of the crawler-interpreted programming language: After writing can be directly executed, without compiling

Code reuse is high: you can directly bring a module containing a function into other programs to use

Cross-platform: Almost all Python programs can be run without modification on different operating systems

Second, the basic Knowledge 2.1 robots Agreement

Robots protocol is also known as the Reptile Protocol, the robot protocol, its full name is called the web crawler Exclusion standard (Robots exclusion Protocol), used to tell crawlers and search engines which pages can be crawled, which can not be crawled. It is usually a text file called robots.txt, placed in the root directory of the Web site.

When a search crawler accesses a site, it first checks to see if a robots.txt file exists in the root directory of the site, and if so, the search crawler crawls based on the crawl range defined therein. If the file is not found, the search crawler accesses all pages that can be accessed directly.

Meaning of the 2.2 URL

Concept:

URL (Protocol (service mode) + IP address (including port number) + specific address), that is, the Uniform Resource Locator, that is, we say the URL, Uniform Resource Locator is the location of the resources available from the Internet and access methods of a concise representation of the Internet is the address of standard resources. Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it. Crawlers crawl data must have a target URL to get the data, so it is the basis for the crawler to obtain data.

Related:

URI = Universal Resource Identifier Uniform Resource Identifier

URL = Universal Resource Locator Uniform Resource Locator

URN = Universal Resource name Uniform Resource Names

Image

2.3 The process of browsing the web

In the process of users to browse the Web page, we may see a lot of good-looking pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, the process is actually user input URL, after the DNS server, find the server host, Send a request to the server, the server after parsing, sent to the user browser HTML, JS, CSS and other files, browser parsing out, the user can see all kinds of pictures, in fact, is the process of an HTTP request

2.4 Agent Fundamentals

2.4.1 Fundamentals

A bridge between the local and the server, at this time the machine is not directly to the Web server to initiate the request, but to the proxy server to make a request, the process of the Web server recognized the real IP is no longer our native IP, the successful implementation of IP camouflage, which is the basic principle of the agent

The role of 2.4.2 agents

1, break through their own IP access restrictions, access to some normally inaccessible sites

2, access to some units or groups of internal resources

3. Hide Real IP

2.4.3 Crawler Agent

In the crawl process may encounter the same IP access too frequently problem, the site will let us enter a verification code or login or directly block the IP, which will bring great inconvenience to crawl. Let the server mistakenly assume that it is the proxy server on the request itself. This will not be blocked by constantly changing agents during the crawl process, which can achieve a good crawl effect.

2.4.4 Agent Classification

FTP proxy server, primarily for access to FTP servers

HTTP proxy Server, primarily for accessing web pages

SSL/TLS proxy, primarily for access to encrypted Web sites

2.4.5 Common proxy settings

Free agent

Paid agent

Image

Iii. Introduction of Reptiles

Image

3.1 Common crawler Lib

Request Library: Requests, selenium (automated test tool) +chromedrive (Chrome Drive), PHANTOMJS (no interface browser)

Parse Library: LXML (HTML, XML, XPath mode), BeautifulSoup (HTML, XML), pyquery (CSS selector supported), TESSEROCR (optical character recognition, verification code)

Databases: MONGO, MySQL, Redis

Repositories: Pymysql, Pymongo, Redispy, Redisdump (Redis data import and export tools)

Web Library: Flask (Lightweight Web Service program), Django

Other tools: Charles (Web Capture kit)

3.2 An entry chestnut

Image

Image

3.3 A little bit more complex chestnuts

Problem: "Anti-theft chain"

Anti-theft chain, the server will identify headers in the referer is not its own, if not, some servers will not respond, so we can also add referer and other information in the headers

Anti-"anti-theft chain"

1, fully simulate the work of the browser

2. Construct Cookie Information

3. Set HEADER information

4. Proxy settings

Other strategies

Timeout setting

3.4 Dynamically rendered page fetching

Splash is a JavaScript rendering service, a lightweight browser with an HTTP API, and it interfaces with Python's Twisted and QT libraries, which we can also use to crawl dynamically rendered pages.

Handle multiple page rendering processes asynchronously

Gets the source code of the rendered page or

Speed up page rendering by turning off image rendering or by using Adblock rules

Executable-specific JavaScript scripts

The page rendering process can be controlled through Lua scripting

Gets the detailed process of rendering and renders it through the HAR (HTTP Archive) format

3.5 Crawler Complete Process

Image

Four, crawler frame 4.1 Pyspider Introduction

A powerful web crawler system written by a nation with powerful WebUI. Written in Python language, distributed architecture, support multiple database backend, powerful WebUI support Script Editor, Task Monitor, project manager and result viewer

Image

4.2 Pyspider characteristics

1, Python script control, you can use any of your favorite HTML parsing package (built-in pyquery)

2, the WEB interface to write debugging scripts, start and stop scripts, monitor the execution status, view the activity history, get results output

3, data storage support MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL and SQLAlchemy

4. Queue services Support RABBITMQ, Beanstalk, Redis and Kombu

5. Support for crawling JavaScript pages

6, components can be replaced, support single/distributed deployment, support Docker deployment

7, powerful scheduling control, support time-out re-crawl and priority setting

8, support python2&3

Image

Image

Image

4.3 scrapy Introduction

Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.

Image

4.4 Scrapy Run Process

1. Scheduler (Scheduler) to remove a link from the download link (URL)

2, the dispatcher starts the Acquisition module Spiders module

3, the acquisition module to the URL to the downloader (Downloader), the download download resources down

4, extract the target data, extract the target object (item), then to the entity pipeline (item pipeline) for further processing, such as deposit database, text

5, if the resolution is a link (url), the URL is inserted into the queue to be crawled

V. Scrapy Framework 5.1 scrapy Basic use

Create Project: Scrapy startproject Tutorial

Create Spider:scrapy Genspider Quotes quotes.toscrapy.com

Running project: Scrapy Crawl DMOZ

Interactive debugging: Scrapy Shell quotes.toscrape.com

Save data (multiple formats): Scrapy crawl Quotes-o Quoqes.json

Image

5.2 Scrapy Global Directives

Startproject: Creating a Project

Genspider: Creating Crawlers

Settings: Get Scrapy's settings

Runspider: Run a spider that is written in a Python file without creating a project

Shell: Launches the scrapy shell with the given URL (if given) or empty (no URL given)

Fetch: Download the given URL using the Scrapy Downloader (downloader) and send the captured content to standard output

View: Opens the given URL in the browser and presents it in the form obtained by the Scrapy spider

Version: Output scrapy versions

5.3 scrapy Project directive

Crawl: Crawling with spiders

Check: Checking items for errors

List: Lists all available spiders in the current project, one spider per line

EDIT: Just provide a shortcut. Developers are free to choose other tools or Ides to write debug spiderparse

Parse: Gets the given URL and uses the appropriate spider analysis processing

Bench: Run Benchmark test

5.4 Scrapy Selector

BeautifulSoup is a very popular web analysis library among programmers, it constructs a Python object based on the structure of the HTML code, and the processing of bad marks is very reasonable, but it has one drawback: slow.

lxml is a Python XML parsing library (also parsing HTML) based on ElementTree (not part of the Python standard library).

Scrapy extract data has its own set of mechanisms. They are called selectors (seletors) because they "select" a portion of the HTML file through a specific XPath or CSS expression.

XPath is a language used to select nodes in an XML file, or it can be used on HTML.

CSS is an HTML document that is styled in a language. Selectors are defined by it and are related to the style of a particular HTML element.

The Scrapy selectors are built on top of the lxml library, which means they are very similar in speed and resolution accuracy.

Image

5.5 Spiders

The spider class defines how to crawl an (or some) Web site. Includes actions for crawling (for example, whether to follow a link) and how to extract structured data from the contents of a Web page (crawl item). In other words, the spider is where you define the action of crawling and analyze a page (or some Web page).

Initializes the request with the initial URL and sets the callback function. When the request is downloaded and returned, response is generated and passed as a parameter to the callback function.

The initial request in the spider is obtained by calling Startrequests (). Startrequests () reads the URL from the Start_urls and generates a Request with the parse as the callback function.

Parses the returned (web) content within the callback function, returning the Item object, Dict, Request, or an iterative container that includes three. The returned Request object is then scrapy processed, downloads the content, and invokes the set callback function (the function can be the same).

Within the callback function, you can use the selector (selectors) (you can also use BeautifulSoup, lxml, or any parser you want) to parse the Web page content and generate item based on the analyzed data.

Finally, the item returned by the spider will be stored in the database (handled by some Item Pipeline) or stored in a file using the Feed exports.

Property

Name: String that defines the name of the spider (String)

Allowed_domains: Contains a list of domain names (domains) allowed to crawl by the spider (list)

Start_urls:url list. Spiders will start crawling from this list when no specific URL is specified

Custom_settings: This setting is a dict. When you start the spider, the setting overrides the project-level settings. Because the setting must be updated before initialization (instantiation), the property must be defined as the Class property

Crawler: This property is set by the class method From_crawler () After initializing class, and the crawler object corresponding to this spider instance is linked.

Settings:crawler Configuration Manager, Extensions (extensions) and middleware (middlewares) use it to access the configuration of Scrapy

Logger:self.logger.info (' Log:%s ', response.status)

Method

From_crawler: If present, call this class method to create the pipeline instance from crawler. It must return a new pipeline instance. The Crawler object provides access to all scrapy core components, such as settings and signals; It is a way for pipeline to access them and hook their functionality into scrapy

Start_requests: The method must return an iterative object (iterable). The object contains the first request that the spider uses to crawl

Make_requests_from_url: The method accepts a URL and returns the Request object for crawling

Parse: When response does not specify a callback function, the method is the default method for Scrapy to process the downloaded response

LOG: Record (log) message using the Scrapy.log.msg () method

Closed: When the spider is closed, the function is called

5.6 Item Pipeline

When item is collected in the Spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order. Each item Pipeline component (sometimes referred to as "Item Pipeline") is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also determine whether the item continues to pass through the pipeline, or is discarded and no longer processed.

Here are some typical applications for item pipeline:

1. Clean Up HTML data

2. Validate crawled data (check item contains some fields)

3. Check the weight (and discard)

4. Save the crawl results to the database

Method

Process_item (self, item, spider): Each item pipeline component needs to call the method, which must return a dict with data, or an item (or any inherited Class) object, or throw a Dropitem exception. The discarded item will not be processed by the subsequent pipeline component.

Open_spider: This method is called when the spider is turned on.

Close_spider: This method is called when the spider is closed

From_crawler: Get Setting configuration information

Example: Data persistence to MONGO

Image

5.7 Downloader Middleware

The downloader middleware is a hook frame that is request/response processing between scrapy. is a lightweight, low-level system for global modification of Scrapy request and response.

Activated

To activate the Downloader middleware component, add it to the Downloader_middlewares settings. This setting is a dictionary (dict), the key is the path of the middleware class, and the value is the order of its middleware.

Method

1. Process_request (Request, Spider): This method is called when each request is downloaded from the middleware

2, ProcessResponse (request, Response, Spider): ProcessRequest () must return one of the following: Returns a Response object, returns a Request object, or raise a Ignore Request exception. If it returns a Response (which can be the same as the incoming Response, or it can be a completely new object), the Response is processed by the Process_response () method of the other middleware in the chain.

3, ProcessException (Request, exception, Spider): When the download processor (download handler) or ProcessRequest () (download middleware) throws an exception (including Ignorerequest exception), Scrapy calls Process_exception ()

Example: adding an agent

Image

Vi. scrapy construction Project 6.1 crawler Ideas

Image

6.2 Actual Project Analysis

Image

Start Page

Image

Project structure

Image

Vii. Distributed Crawler 7.1 Single host crawler architecture

This machine maintains a crawler queue, scheduler scheduling

Q: What is the key to multiple host collaboration?

A: Shared crawler queues

Image

Single Master crawler architecture

Image

7.2 Distributed Crawler Architecture

Image

Distributed crawler Architecture

Image

7.3 Questions

Q1: How to choose the queue?

A1:redis queue

Redis non-relational database, Key-value form storage, flexible structure

In-memory data structure storage system, fast processing speed, good performance

Provides a variety of storage structures such as queues and collections for easy queue maintenance

Q2: How to go heavy?

A2:redis Collection

Redis provides a collection of data structures that store the thumbprint of each request in a Redis collection

When storing the request fingerprint to the Redis collection, verify that the fingerprint exists?

[If present]: No request to queue is added

[does not exist]: The request is added to the queue and the fingerprint is added to the collection

Q3: How to prevent interruptions?

A3: Start judgment

The current Redis reqeust queue is first judged to be empty at the start of each slave scrapy

[NOT NULL]: Take the next request from the queue and perform the crawler

[Empty]: Restart Crawl, the first slave to perform a crawl-orientation queue to add a request

Q4: How does this architecture be implemented?

A4:scrapy-redis

Scrapy is a very useful crawler framework for Python, very powerful, but when we have to crawl a very large number of pages, the processing power of a single host can not meet our needs (whether it is processing speed or the number of concurrent network requests), this time the advantages of distributed crawlers appear, People are more powerful. And Scrapy-redis is a combination of distributed database Redis, rewrite scrapy Some of the more critical code (Scrapy Scheduler, queue and other components), will scrapy into a multi-host can run concurrently on the distributed crawler.

GitHub Address: Https://github.com/rmax/scrapy-redis

7.4 Source Code Interpretation

Read the source before: need to understand the operation principle of scrapy, otherwise it is useless.

The main body of the Scrapy-redis project is the Redis and scrapy two libraries, combining the core functions of two libraries to achieve distributed.

Image

1 connection.py

Responsible for instantiating Redis connections according to the setting configuration. is called by Dupefilter and Scheduler, in short, it involves redis access to use this module

Image

2 dupefilter.py

By inheriting Basedupefilter to rewrite his method, the request weighing based on Redis was realized. Scrapy-redis inserting fingerprint using a set of Redis (different spider keys are different)

Spider name +dupefilter key is for different crawler instances on different hosts, as long as they belong to the same spider, they will access the same set, and this set is their URL to the heavy pool.

Dupefilter the weight will be used in the scheduler class, each request before entering the dispatch must be sentenced to the weight, if the repetition does not need to participate in the scheduling, directly discard the good

Image

3 picklecompat.py

Loads and dumps two functions, in fact, is to implement a serializer, because the Redis database can not store complex objects (the value can only be a string, a list of strings, a collection of strings and the Hash,key part can only be strings), So it needs to be serialized into text before it can be stored.

This serializer is mainly used for scheduler Reuqest objects.

Why not use JSON format? (the serialization of item pipeline is JSON by default)

Image

4 pipelines.py

The pipeline file implements an item Pipieline class, and Scrapy's item pipeline is the same object that gets our configured settings as key from Redisitemskey, After serializing the item into the value of the Redis database (this value is a list, each of our item is a node in the list), this pipeline the extracted item to save it, mainly for the convenience of delaying processing of the data.

Image

5 queue.py

Here are three ways to implement the Queue

Spiderqueue (queue): FIFO

Spiderstack (Stack): Advanced back out

Spiderpriorityqueue (priority queue)

These container classes are used as containers for scheduler dispatch request, and Scheduler is instantiated on each host and corresponds to spider one by one. So the distributed runtime will have multiple instances of a spider and multiple instances of a scheduler exist on different hosts, but because scheduler are all in the same container, and these containers are connected to the same Redis server, Using the spider name plus queue as key to read and write data, so different crawler instances on different hosts a request scheduling pool, the implementation of the unified scheduling between the distributed crawler.

Image

6 scheduler.py

Rewrite the scheduler class, instead of the original scheduler Scrapy.core.scheduler, the original scheduler logic has not changed greatly, mainly using Redis as a data storage medium, in order to achieve a unified scheduling between the various crawlers. The scheduler is responsible for scheduling requests for each spider, scheduler initialization, reads the type of queue and dupefilters through the settings file, and configures the key used by queue and dupefilters. Whenever a request is scheduled, Enqueuerequest is called, Scheduler uses Dupefilters to determine if the URL is duplicated, and if it is not duplicated, it is added to the queue's container (which can be configured in Settings). When the dispatch is complete, nextrequest is called, the Scheduler passes through the Queue container interface, takes out a request, sends him to the corresponding spider, lets the spider crawl the work.

Python Learning Exchange Group: 125240963

7 spider.py

The spider is designed to read the URL to crawl from Redis, then perform a crawl, and if more URLs are returned during the crawl, proceed until all request finishes. Then continue reading the URL from Redis, looping through the process.

Analysis: The crawler status is monitored by the Connect signals.spideridle signal in this spider. When idle, the new Makerequestsfromurl (URL) is returned to the engine, which is then given to the scheduler.

Reprint: Https://www.jianshu.com/p/ec3dfaec3c9b?utm_source=tuicool&utm_medium=referral

Python3 Distributed crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.