Python Crawler Summary

Source: Internet
Author: User
Tags send cookies

[TOC]

For some reasons recently finally can be taken from the work of the trivial, there is time to some of the previous crawler knowledge to a simple comb, but also from the stage to understand the knowledge of the past is really necessary to comb.

Common third-party libraries

For crawler beginners, it is recommended that after understanding the crawler principle, without using any crawler frame, using these common third-party libraries to implement a simple crawler, which will deepen the understanding of the crawler.

Both Urllib and requests are Python's HTTP libraries, including the URLLIB2 module, which acquires comprehensive functionality at a tremendous cost of complexity. It is simpler to support a complete simple use case than the Urllib2,requests module. About the pros and cons of Urllib and requests, you can check it online.

Both BeautifulSoup and lxml are libraries of Python page parsing. BeautifulSoup is DOM-based and loads the entire document, parsing the entire DOM tree, so the time and memory overhead will be much larger. lxml will only perform local traversal, and the tag can be positioned quickly using XPath. BS4 is written in Python, lxml is the C language implementation, also decided lxml than BS4 faster.

This blog has a relatively comprehensive collection of common third-party libraries for Python crawlers, which can be used as a reference.
60877817

Reptile Frame

Python's common crawler frame is scrapy and Pyspider two.
For more information on how to use the framework, refer to the official documentation.

Dynamic page Rendering 1. URL Request Analysis

(1) Carefully analyze the structure of the page to see the action of JS response;
(2) using the browser to analyze the JS click Action issued by the request URL;
(3) The URL of this asynchronous request is crawled again as Scrapy's start_url or yield reques.

2. Selenium

Selenium is a web-based automated testing tool, originally developed for website Automation testing, type like we play the game with the Key wizard, can be automated according to the specified command, the difference is that selenium can run directly on the browser, it supports all major browsers ( Including PHANTOMJS these non-interface browsers).

According to our instructions, selenium can let the browser automatically load the page, get the required page, or even a screenshot of the page, or determine whether certain actions on the site occur.

Selenium does not have a browser, does not support browser features, it needs to be combined with a third-party browser to use.

3. Phantomjs

When using selenium to invoke a browser to crawl a page, the operation to open the browser and render the page is inefficient when large-scale data fetching is not enough to meet the requirements. At this point we can choose to use PHANTOMJS.

PHANTOMJS is a webkit-based "No Interface" (headless) browser that loads Web sites into memory and executes JavaScript on the page because the graphical interface is not displayed, so it is more efficient to run than a full browser.

If we combine selenium and PHANTOMJS, we can run a very powerful web crawler that can handle JavaScript, cookies, headers, and anything that our real users need to do.

4. Splash

Splash is a JavaScript rendering service. It is a lightweight browser that implements the HTTP API, and Splash is implemented in Python, using both twisted and QT. Twisted (QT) is used to enable the service to have asynchronous processing capability to perform webkit concurrency.

Python connects Splash's library called Scrapy-splash,scrapy-splash using the splash HTTP API, so a splash instance is required, typically using Docker to run Splash, So you need to install Docker.

5. Spynner

Spynner is a Qtwebkit client that simulates a browser, finishes loading pages, raises events, fills out forms, and so on.

Crawler anti-masking strategy 1. Modify User-agent

User-agent is one of the most common means of disguising browsers.

User-agent refers to a string that contains browser information, operating system information, and so on, also known as a special network protocol. The server uses it to determine whether the current object is a browser, a mail client, or a web crawler. In Request.headers, you can view user-agent, about how to analyze a packet, view its user-agent, and so on, as mentioned in the previous article.

The specific method can change the value of the user-agent to the browser, even can set a user-agent pool (list, array, dictionary can), hold multiple "browser", each crawl when randomly take a to set the request User-agent, So user-agent will always be changing to prevent the wall from being.

2. Cookies are prohibited

Cookies are actually stored in the user terminal of some encrypted data, some websites through the use of cookies to identify the user identity, if a visit is always high frequency of the request, it is likely to be noticed by the site, is suspected as a crawler, The site can then deny access to the user by using a cookie to find the access.

By prohibiting cookies, this is the client actively blocking the server from writing. The prohibition of cookies prevents us from being banned from websites that may use cookies to identify crawlers.

In the Scrapy crawler can be set cookies_enables= FALSE, that is, do not enable cookies middleware, do not send cookies to Web server.

3. Set the request time interval

Large-scale centralized access has a large impact on servers, and crawlers can increase server load for a short time. Note here: Set the download wait time range control, waiting time is too long, can not meet the requirements of a short time large-scale crawl, waiting time is too short is likely to be denied access.

Set reasonable request time interval, not only to ensure the crawler crawl efficiency, but also do not have a large impact on the other server.

4. Proxy IP Pool

In fact, micro-knowledgeable other is IP, not account. In other words, there is no point in simulating a login when it is necessary to fetch a lot of data continuously. As long as the same IP, no matter how to change the account is not used, the main is to change the IP.

One of the Web server's strategies for crawling is to block the IP or the entire IP segment from blocking access, and when the IP is banned, it can be switched to another IP for further access. Method: Proxy IP, local IP database (using IP pool).

5. Using Selenium

The use of selenium to simulate manual click access to the site is a very effective way to prevent the ban. However, selenium is inefficient and not suitable for large-scale data capture.

6. Crack Verification Code

Verification codes are now the most common means of preventing reptiles. The ability of small partners can write their own algorithm to crack the verification code, but generally we can spend some money using the interface of the third-party coding platform, easy to achieve verification code crack.

Conclusion

The above is a bit of the Python crawler comb, specific to a technical point need to check the details of their own. Hope for the students to learn a reptile a little help.

Python Crawler Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.