Python3 Crawler (16) Pyspider frame

Last Update:2018-05-06 Source: Internet

Author: User

Tags define get message queue python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Infi-chu:

http://www.cnblogs.com/Infi-chu/

I. Introduction of Pyspider
1. Basic functions
Provides WebUI visualization for easy programming and debugging of crawlers
Provides crawl progress monitoring, crawl results viewing, crawler project management
Support multiple databases, MySQL, MongoDB, Redis, SQLite, PostgreSQL, etc.
Supports multiple message queues, RabbitMQ, Beanstalk, Redis, etc.
Provides priority control, failed retry, timed fetch, etc.
Docking the Phantomjs to capture JavaScript pages
Support for stand-alone, distributed, Docker deployments

2.pyspider vs. Scrapy
Pyspider provides webui,scrapy native does not have this function
Pyspider Easy Commissioning
Pyspider supports PHANTOMJS, scrapy support Scrapy-splash components
Pyspider built-in pyquery as a selector, scrapy interfaces with XPath, CSS selectors, and regular expressions
Pyspider Low degree of expansion

3. Frame Design
Three main modules: Scheduler (Scheduler), Gripper (Fetcher), processor (processer)

4. Specific process
1. Each Pyspider project uses a Python script that defines a handler class, uses the On_Start () method, starts the project, and then is dispatched to the scheduler for dispatch processing
2.Scheduler passes the crawl task to the Fetcher,fetcher response and passes the response to Processer
3.Processer processes and extracts a new URL and passes it through the message queue to scheduler, and if a new fetch result is generated, it is sent to the result queue waiting for result worker to process
4. Loop the above process until the end of the crawl and call On_finished () at the end

5. Example
Https://github.com/Infi-chu/quna

Second, Pyspider detailed
1. Start:
Pyspider All
2.crawl () method
URL: A crawled URL that can be defined as a single URL string or a list of URLs
Callback: callback function that specifies the method by which the response content of the URL should be resolved
Age: The effective time of the task
Priority: Precedence, default = 0, bigger, higher priority
Exetime: You can set a timed task whose value is a timestamp, the default is 0, which represents immediate execution
Retries: Retry count, default is 3
Itag: Sets the value of the node that determines whether the page has changed
Auto_recrawl: When enabled, the crawl task is re-executed after it expires
Method:http Request method
Params: Define GET request parameters
Data: Define POST request parameters
Files: file to be uploaded, specify file name
User_agent:user-agent
Headers:request headers
Cookies:cookies, dictionary format
Connect_timeout: Maximum wait time when initializing a connection, default is 20 seconds
Timeout: The longest waiting time to crawl a webpage, which is 120 seconds by default
Allow_redirects: Determines whether redirection is handled automatically, by default True
Validate_cert: Whether to validate certificate, default is True
Proxy: Agent
Fetch_type: Turn on PHANTOMJS rendering
Js_script: JavaScript script executed after page load is complete
Js_run_at: Script run location, default at node end
Js_viewport_width/js_viewport_height:javascript the window size of the rendered page
Load_images: Determines whether the picture is loaded, default is False
Save: Passing parameters between different methods
Cancel: Cancels the task
Force_update: Force update status
3. Task differentiation:
Determines whether the MD5 value of the URL is the same as the same task
4. Global configuration:
Specifying the global configuration in Crawl_config
5. Timed Crawl
To set the time interval by using the Every property
6. Project Status:
Todo just created not yet executed
Stop Stop
Checking the running project is modified
Debug/runnning Run
Pause multiple errors, suspend or pause
7. Deleting items
Set status to stop, group name modified to delete,24 hours after automatic deletion

Python3 Crawler (16) Pyspider frame

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More