Python3 Crawler (16) Pyspider frame

Source: Internet
Author: User
Tags define get message queue python script


Infi-chu:

http://www.cnblogs.com/Infi-chu/

I. Introduction of Pyspider
1. Basic functions
Provides WebUI visualization for easy programming and debugging of crawlers
Provides crawl progress monitoring, crawl results viewing, crawler project management
Support multiple databases, MySQL, MongoDB, Redis, SQLite, PostgreSQL, etc.
Supports multiple message queues, RabbitMQ, Beanstalk, Redis, etc.
Provides priority control, failed retry, timed fetch, etc.
Docking the Phantomjs to capture JavaScript pages
Support for stand-alone, distributed, Docker deployments

2.pyspider vs. Scrapy
Pyspider provides webui,scrapy native does not have this function
Pyspider Easy Commissioning
Pyspider supports PHANTOMJS, scrapy support Scrapy-splash components
Pyspider built-in pyquery as a selector, scrapy interfaces with XPath, CSS selectors, and regular expressions
Pyspider Low degree of expansion

3. Frame Design
Three main modules: Scheduler (Scheduler), Gripper (Fetcher), processor (processer)

4. Specific process
1. Each Pyspider project uses a Python script that defines a handler class, uses the On_Start () method, starts the project, and then is dispatched to the scheduler for dispatch processing
2.Scheduler passes the crawl task to the Fetcher,fetcher response and passes the response to Processer
3.Processer processes and extracts a new URL and passes it through the message queue to scheduler, and if a new fetch result is generated, it is sent to the result queue waiting for result worker to process
4. Loop the above process until the end of the crawl and call On_finished () at the end

5. Example
Https://github.com/Infi-chu/quna

Second, Pyspider detailed
1. Start:
Pyspider All
2.crawl () method
URL: A crawled URL that can be defined as a single URL string or a list of URLs
Callback: callback function that specifies the method by which the response content of the URL should be resolved
Age: The effective time of the task
Priority: Precedence, default = 0, bigger, higher priority
Exetime: You can set a timed task whose value is a timestamp, the default is 0, which represents immediate execution
Retries: Retry count, default is 3
Itag: Sets the value of the node that determines whether the page has changed
Auto_recrawl: When enabled, the crawl task is re-executed after it expires
Method:http Request method
Params: Define GET request parameters
Data: Define POST request parameters
Files: file to be uploaded, specify file name
User_agent:user-agent
Headers:request headers
Cookies:cookies, dictionary format
Connect_timeout: Maximum wait time when initializing a connection, default is 20 seconds
Timeout: The longest waiting time to crawl a webpage, which is 120 seconds by default
Allow_redirects: Determines whether redirection is handled automatically, by default True
Validate_cert: Whether to validate certificate, default is True
Proxy: Agent
Fetch_type: Turn on PHANTOMJS rendering
Js_script: JavaScript script executed after page load is complete
Js_run_at: Script run location, default at node end
Js_viewport_width/js_viewport_height:javascript the window size of the rendered page
Load_images: Determines whether the picture is loaded, default is False
Save: Passing parameters between different methods
Cancel: Cancels the task
Force_update: Force update status
3. Task differentiation:
Determines whether the MD5 value of the URL is the same as the same task
4. Global configuration:
Specifying the global configuration in Crawl_config
5. Timed Crawl
To set the time interval by using the Every property
6. Project Status:
Todo just created not yet executed
Stop Stop
Checking the running project is modified
Debug/runnning Run
Pause multiple errors, suspend or pause
7. Deleting items
Set status to stop, group name modified to delete,24 hours after automatic deletion

Python3 Crawler (16) Pyspider frame

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.