Background
Department (Oriental IC, graphic worm) business-driven, need to collect a large number of picture resources, do data analysis, as well as genuine image rights. First, the main use node to do the crawler (business is relatively simple, more familiar with node). With the change of business demand, large-scale crawler encounters various problems. The Python crawler has the innate superiority, the community resources are quite complete, the various frameworks also perfect support. Crawler performance has also been greatly improved. This sharing from the basic knowledge, involving two Python framework Pyspider, Scrapy, and based on Scrapy, Scrapy-redis do a distributed crawler introduction (directly pasted PPT) will involve Redis, MongoDB and other related knowledge.
For anti-theft chain (automatic login, auto-registration ...) and common strategies), proxies, crawler snapshots, object resources into the TOS have not been introduced too much. This two month we are doing the visualization of the crawler platform, configurable, and process-based.
First, what is the Frontier 1.1 crawler?
Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script.
1.2 Why is Python?
Easy to learn: simple enough to not learn any programming language people to read a little bit of information can be written out of the crawler-interpreted programming language: After writing can be directly executed, without compiling
Code reuse is high: you can directly bring a module containing a function into other programs to use
Cross-platform: Almost all Python programs can be run without modification on different operating systems
Second, the basic Knowledge 2.1 robots Agreement
Robots protocol is also known as the Reptile Protocol, the robot protocol, its full name is called the web crawler Exclusion standard (Robots exclusion Protocol), used to tell crawlers and search engines which pages can be crawled, which can not be crawled. It is usually a text file called robots.txt, placed in the root directory of the Web site.
When a search crawler accesses a site, it first checks to see if a robots.txt file exists in the root directory of the site, and if so, the search crawler crawls based on the crawl range defined therein. If the file is not found, the search crawler accesses all pages that can be accessed directly.
Meaning of the 2.2 URL
Concept:
URL (Protocol (service mode) + IP address (including port number) + specific address), that is, the Uniform Resource Locator, that is, we say the URL, Uniform Resource Locator is the location of the resources available from the Internet and access methods of a concise representation of the Internet is the address of standard resources. Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it. Crawlers crawl data must have a target URL to get the data, so it is the basis for the crawler to obtain data.
Related:
URI = Universal Resource Identifier Uniform Resource Identifier
URL = Universal Resource Locator Uniform Resource Locator
URN = Universal Resource name Uniform Resource Names
Image
2.3 The process of browsing the web
In the process of users to browse the Web page, we may see a lot of good-looking pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, the process is actually user input URL, after the DNS server, find the server host, Send a request to the server, the server after parsing, sent to the user browser HTML, JS, CSS and other files, browser parsing out, the user can see all kinds of pictures, in fact, is the process of an HTTP request
2.4 Agent Fundamentals
2.4.1 Fundamentals
A bridge between the local and the server, at this time the machine is not directly to the Web server to initiate the request, but to the proxy server to make a request, the process of the Web server recognized the real IP is no longer our native IP, the successful implementation of IP camouflage, which is the basic principle of the agent
The role of 2.4.2 agents
1, break through their own IP access restrictions, access to some normally inaccessible sites
2, access to some units or groups of internal resources
3. Hide Real IP
2.4.3 Crawler Agent
In the crawl process may encounter the same IP access too frequently problem, the site will let us enter a verification code or login or directly block the IP, which will bring great inconvenience to crawl. Let the server mistakenly assume that it is the proxy server on the request itself. This will not be blocked by constantly changing agents during the crawl process, which can achieve a good crawl effect.
2.4.4 Agent Classification
FTP proxy server, primarily for access to FTP servers
HTTP proxy Server, primarily for accessing web pages
SSL/TLS proxy, primarily for access to encrypted Web sites
2.4.5 Common proxy settings
Free agent
Paid agent
Image
Iii. Introduction of Reptiles
Image
3.1 Common crawler Lib
Request Library: Requests, selenium (automated test tool) +chromedrive (Chrome Drive), PHANTOMJS (no interface browser)
Parse Library: LXML (HTML, XML, XPath mode), BeautifulSoup (HTML, XML), pyquery (CSS selector supported), TESSEROCR (optical character recognition, verification code)
Databases: MONGO, MySQL, Redis
Repositories: Pymysql, Pymongo, Redispy, Redisdump (Redis data import and export tools)
Web Library: Flask (Lightweight Web Service program), Django
Other tools: Charles (Web Capture kit)
3.2 An entry chestnut
Image
Image
3.3 A little bit more complex chestnuts
Problem: "Anti-theft chain"
Anti-theft chain, the server will identify headers in the referer is not its own, if not, some servers will not respond, so we can also add referer and other information in the headers
Anti-"anti-theft chain"
1, fully simulate the work of the browser
2. Construct Cookie Information
3. Set HEADER information
4. Proxy settings
Other strategies
Timeout setting
3.4 Dynamically rendered page fetching
Splash is a JavaScript rendering service, a lightweight browser with an HTTP API, and it interfaces with Python's Twisted and QT libraries, which we can also use to crawl dynamically rendered pages.
Handle multiple page rendering processes asynchronously
Gets the source code of the rendered page or
Speed up page rendering by turning off image rendering or by using Adblock rules
Executable-specific JavaScript scripts
The page rendering process can be controlled through Lua scripting
Gets the detailed process of rendering and renders it through the HAR (HTTP Archive) format
3.5 Crawler Complete Process
Image
Four, crawler frame 4.1 Pyspider Introduction
A powerful web crawler system written by a nation with powerful WebUI. Written in Python language, distributed architecture, support multiple database backend, powerful WebUI support Script Editor, Task Monitor, project manager and result viewer
Image
4.2 Pyspider characteristics
1, Python script control, you can use any of your favorite HTML parsing package (built-in pyquery)
2, the WEB interface to write debugging scripts, start and stop scripts, monitor the execution status, view the activity history, get results output
3, data storage support MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL and SQLAlchemy
4. Queue services Support RABBITMQ, Beanstalk, Redis and Kombu
5. Support for crawling JavaScript pages
6, components can be replaced, support single/distributed deployment, support Docker deployment
7, powerful scheduling control, support time-out re-crawl and priority setting
8, support python2&3
Image
Image
Image
4.3 scrapy Introduction
Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
Image
4.4 Scrapy Run Process
1. Scheduler (Scheduler) to remove a link from the download link (URL)
2, the dispatcher starts the Acquisition module Spiders module
3, the acquisition module to the URL to the downloader (Downloader), the download download resources down
4, extract the target data, extract the target object (item), then to the entity pipeline (item pipeline) for further processing, such as deposit database, text
5, if the resolution is a link (url), the URL is inserted into the queue to be crawled
V. Scrapy Framework 5.1 scrapy Basic use
Create Project: Scrapy startproject Tutorial
Create Spider:scrapy Genspider Quotes quotes.toscrapy.com
Running project: Scrapy Crawl DMOZ
Interactive debugging: Scrapy Shell quotes.toscrape.com
Save data (multiple formats): Scrapy crawl Quotes-o Quoqes.json
Image
5.2 Scrapy Global Directives
Startproject: Creating a Project
Genspider: Creating Crawlers
Settings: Get Scrapy's settings
Runspider: Run a spider that is written in a Python file without creating a project
Shell: Launches the scrapy shell with the given URL (if given) or empty (no URL given)
Fetch: Download the given URL using the Scrapy Downloader (downloader) and send the captured content to standard output
View: Opens the given URL in the browser and presents it in the form obtained by the Scrapy spider
Version: Output scrapy versions
5.3 scrapy Project directive
Crawl: Crawling with spiders
Check: Checking items for errors
List: Lists all available spiders in the current project, one spider per line
EDIT: Just provide a shortcut. Developers are free to choose other tools or Ides to write debug spiderparse
Parse: Gets the given URL and uses the appropriate spider analysis processing
Bench: Run Benchmark test
5.4 Scrapy Selector
BeautifulSoup is a very popular web analysis library among programmers, it constructs a Python object based on the structure of the HTML code, and the processing of bad marks is very reasonable, but it has one drawback: slow.
lxml is a Python XML parsing library (also parsing HTML) based on ElementTree (not part of the Python standard library).
Scrapy extract data has its own set of mechanisms. They are called selectors (seletors) because they "select" a portion of the HTML file through a specific XPath or CSS expression.
XPath is a language used to select nodes in an XML file, or it can be used on HTML.
CSS is an HTML document that is styled in a language. Selectors are defined by it and are related to the style of a particular HTML element.
The Scrapy selectors are built on top of the lxml library, which means they are very similar in speed and resolution accuracy.
Image
5.5 Spiders
The spider class defines how to crawl an (or some) Web site. Includes actions for crawling (for example, whether to follow a link) and how to extract structured data from the contents of a Web page (crawl item). In other words, the spider is where you define the action of crawling and analyze a page (or some Web page).
Initializes the request with the initial URL and sets the callback function. When the request is downloaded and returned, response is generated and passed as a parameter to the callback function.
The initial request in the spider is obtained by calling Startrequests (). Startrequests () reads the URL from the Start_urls and generates a Request with the parse as the callback function.
Parses the returned (web) content within the callback function, returning the Item object, Dict, Request, or an iterative container that includes three. The returned Request object is then scrapy processed, downloads the content, and invokes the set callback function (the function can be the same).
Within the callback function, you can use the selector (selectors) (you can also use BeautifulSoup, lxml, or any parser you want) to parse the Web page content and generate item based on the analyzed data.
Finally, the item returned by the spider will be stored in the database (handled by some Item Pipeline) or stored in a file using the Feed exports.
Property
Name: String that defines the name of the spider (String)
Allowed_domains: Contains a list of domain names (domains) allowed to crawl by the spider (list)
Start_urls:url list. Spiders will start crawling from this list when no specific URL is specified
Custom_settings: This setting is a dict. When you start the spider, the setting overrides the project-level settings. Because the setting must be updated before initialization (instantiation), the property must be defined as the Class property
Crawler: This property is set by the class method From_crawler () After initializing class, and the crawler object corresponding to this spider instance is linked.
Settings:crawler Configuration Manager, Extensions (extensions) and middleware (middlewares) use it to access the configuration of Scrapy
Logger:self.logger.info (' Log:%s ', response.status)
Method
From_crawler: If present, call this class method to create the pipeline instance from crawler. It must return a new pipeline instance. The Crawler object provides access to all scrapy core components, such as settings and signals; It is a way for pipeline to access them and hook their functionality into scrapy
Start_requests: The method must return an iterative object (iterable). The object contains the first request that the spider uses to crawl
Make_requests_from_url: The method accepts a URL and returns the Request object for crawling
Parse: When response does not specify a callback function, the method is the default method for Scrapy to process the downloaded response
LOG: Record (log) message using the Scrapy.log.msg () method
Closed: When the spider is closed, the function is called
5.6 Item Pipeline
When item is collected in the Spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order. Each item Pipeline component (sometimes referred to as "Item Pipeline") is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also determine whether the item continues to pass through the pipeline, or is discarded and no longer processed.
Here are some typical applications for item pipeline:
1. Clean Up HTML data
2. Validate crawled data (check item contains some fields)
3. Check the weight (and discard)
4. Save the crawl results to the database
Method
Process_item (self, item, spider): Each item pipeline component needs to call the method, which must return a dict with data, or an item (or any inherited Class) object, or throw a Dropitem exception. The discarded item will not be processed by the subsequent pipeline component.
Open_spider: This method is called when the spider is turned on.
Close_spider: This method is called when the spider is closed
From_crawler: Get Setting configuration information
Example: Data persistence to MONGO
Image
5.7 Downloader Middleware
The downloader middleware is a hook frame that is request/response processing between scrapy. is a lightweight, low-level system for global modification of Scrapy request and response.
Activated
To activate the Downloader middleware component, add it to the Downloader_middlewares settings. This setting is a dictionary (dict), the key is the path of the middleware class, and the value is the order of its middleware.
Method
1. Process_request (Request, Spider): This method is called when each request is downloaded from the middleware
2, ProcessResponse (request, Response, Spider): ProcessRequest () must return one of the following: Returns a Response object, returns a Request object, or raise a Ignore Request exception. If it returns a Response (which can be the same as the incoming Response, or it can be a completely new object), the Response is processed by the Process_response () method of the other middleware in the chain.
3, ProcessException (Request, exception, Spider): When the download processor (download handler) or ProcessRequest () (download middleware) throws an exception (including Ignorerequest exception), Scrapy calls Process_exception ()
Example: adding an agent
Image
Vi. scrapy construction Project 6.1 crawler Ideas
Image
6.2 Actual Project Analysis
Image
Start Page
Image
Project structure
Image
Vii. Distributed Crawler 7.1 Single host crawler architecture
This machine maintains a crawler queue, scheduler scheduling
Q: What is the key to multiple host collaboration?
A: Shared crawler queues
Image
Single Master crawler architecture
Image
7.2 Distributed Crawler Architecture
Image
Distributed crawler Architecture
Image
7.3 Questions
Q1: How to choose the queue?
A1:redis queue
Redis non-relational database, Key-value form storage, flexible structure
In-memory data structure storage system, fast processing speed, good performance
Provides a variety of storage structures such as queues and collections for easy queue maintenance
Q2: How to go heavy?
A2:redis Collection
Redis provides a collection of data structures that store the thumbprint of each request in a Redis collection
When storing the request fingerprint to the Redis collection, verify that the fingerprint exists?
[If present]: No request to queue is added
[does not exist]: The request is added to the queue and the fingerprint is added to the collection
Q3: How to prevent interruptions?
A3: Start judgment
The current Redis reqeust queue is first judged to be empty at the start of each slave scrapy
[NOT NULL]: Take the next request from the queue and perform the crawler
[Empty]: Restart Crawl, the first slave to perform a crawl-orientation queue to add a request
Q4: How does this architecture be implemented?
A4:scrapy-redis
Scrapy is a very useful crawler framework for Python, very powerful, but when we have to crawl a very large number of pages, the processing power of a single host can not meet our needs (whether it is processing speed or the number of concurrent network requests), this time the advantages of distributed crawlers appear, People are more powerful. And Scrapy-redis is a combination of distributed database Redis, rewrite scrapy Some of the more critical code (Scrapy Scheduler, queue and other components), will scrapy into a multi-host can run concurrently on the distributed crawler.
GitHub Address: Https://github.com/rmax/scrapy-redis
7.4 Source Code Interpretation
Read the source before: need to understand the operation principle of scrapy, otherwise it is useless.
The main body of the Scrapy-redis project is the Redis and scrapy two libraries, combining the core functions of two libraries to achieve distributed.
Image
1 connection.py
Responsible for instantiating Redis connections according to the setting configuration. is called by Dupefilter and Scheduler, in short, it involves redis access to use this module
Image
2 dupefilter.py
By inheriting Basedupefilter to rewrite his method, the request weighing based on Redis was realized. Scrapy-redis inserting fingerprint using a set of Redis (different spider keys are different)
Spider name +dupefilter key is for different crawler instances on different hosts, as long as they belong to the same spider, they will access the same set, and this set is their URL to the heavy pool.
Dupefilter the weight will be used in the scheduler class, each request before entering the dispatch must be sentenced to the weight, if the repetition does not need to participate in the scheduling, directly discard the good
Image
3 picklecompat.py
Loads and dumps two functions, in fact, is to implement a serializer, because the Redis database can not store complex objects (the value can only be a string, a list of strings, a collection of strings and the Hash,key part can only be strings), So it needs to be serialized into text before it can be stored.
This serializer is mainly used for scheduler Reuqest objects.
Why not use JSON format? (the serialization of item pipeline is JSON by default)
Image
4 pipelines.py
The pipeline file implements an item Pipieline class, and Scrapy's item pipeline is the same object that gets our configured settings as key from Redisitemskey, After serializing the item into the value of the Redis database (this value is a list, each of our item is a node in the list), this pipeline the extracted item to save it, mainly for the convenience of delaying processing of the data.
Image
5 queue.py
Here are three ways to implement the Queue
Spiderqueue (queue): FIFO
Spiderstack (Stack): Advanced back out
Spiderpriorityqueue (priority queue)
These container classes are used as containers for scheduler dispatch request, and Scheduler is instantiated on each host and corresponds to spider one by one. So the distributed runtime will have multiple instances of a spider and multiple instances of a scheduler exist on different hosts, but because scheduler are all in the same container, and these containers are connected to the same Redis server, Using the spider name plus queue as key to read and write data, so different crawler instances on different hosts a request scheduling pool, the implementation of the unified scheduling between the distributed crawler.
Image
6 scheduler.py
Rewrite the scheduler class, instead of the original scheduler Scrapy.core.scheduler, the original scheduler logic has not changed greatly, mainly using Redis as a data storage medium, in order to achieve a unified scheduling between the various crawlers. The scheduler is responsible for scheduling requests for each spider, scheduler initialization, reads the type of queue and dupefilters through the settings file, and configures the key used by queue and dupefilters. Whenever a request is scheduled, Enqueuerequest is called, Scheduler uses Dupefilters to determine if the URL is duplicated, and if it is not duplicated, it is added to the queue's container (which can be configured in Settings). When the dispatch is complete, nextrequest is called, the Scheduler passes through the Queue container interface, takes out a request, sends him to the corresponding spider, lets the spider crawl the work.
Python Learning Exchange Group: 125240963
7 spider.py
The spider is designed to read the URL to crawl from Redis, then perform a crawl, and if more URLs are returned during the crawl, proceed until all request finishes. Then continue reading the URL from Redis, looping through the process.
Analysis: The crawler status is monitored by the Connect signals.spideridle signal in this spider. When idle, the new Makerequestsfromurl (URL) is returned to the engine, which is then given to the scheduler.
Reprint: Https://www.jianshu.com/p/ec3dfaec3c9b?utm_source=tuicool&utm_medium=referral
Python3 Distributed crawler