Python Crawler Summary

Last Update:2018-05-10 Source: Internet

Author: User

Tags send cookies

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[TOC]

For some reasons recently finally can be taken from the work of the trivial, there is time to some of the previous crawler knowledge to a simple comb, but also from the stage to understand the knowledge of the past is really necessary to comb.

Common third-party libraries

For crawler beginners, it is recommended that after understanding the crawler principle, without using any crawler frame, using these common third-party libraries to implement a simple crawler, which will deepen the understanding of the crawler.

Both Urllib and requests are Python's HTTP libraries, including the URLLIB2 module, which acquires comprehensive functionality at a tremendous cost of complexity. It is simpler to support a complete simple use case than the Urllib2,requests module. About the pros and cons of Urllib and requests, you can check it online.

Both BeautifulSoup and lxml are libraries of Python page parsing. BeautifulSoup is DOM-based and loads the entire document, parsing the entire DOM tree, so the time and memory overhead will be much larger. lxml will only perform local traversal, and the tag can be positioned quickly using XPath. BS4 is written in Python, lxml is the C language implementation, also decided lxml than BS4 faster.

This blog has a relatively comprehensive collection of common third-party libraries for Python crawlers, which can be used as a reference.
60877817

Reptile Frame

Python's common crawler frame is scrapy and Pyspider two.
For more information on how to use the framework, refer to the official documentation.

Dynamic page Rendering 1. URL Request Analysis

(1) Carefully analyze the structure of the page to see the action of JS response;
(2) using the browser to analyze the JS click Action issued by the request URL;
(3) The URL of this asynchronous request is crawled again as Scrapy's start_url or yield reques.

2. Selenium

Selenium is a web-based automated testing tool, originally developed for website Automation testing, type like we play the game with the Key wizard, can be automated according to the specified command, the difference is that selenium can run directly on the browser, it supports all major browsers ( Including PHANTOMJS these non-interface browsers).

According to our instructions, selenium can let the browser automatically load the page, get the required page, or even a screenshot of the page, or determine whether certain actions on the site occur.

Selenium does not have a browser, does not support browser features, it needs to be combined with a third-party browser to use.

3. Phantomjs

When using selenium to invoke a browser to crawl a page, the operation to open the browser and render the page is inefficient when large-scale data fetching is not enough to meet the requirements. At this point we can choose to use PHANTOMJS.

PHANTOMJS is a webkit-based "No Interface" (headless) browser that loads Web sites into memory and executes JavaScript on the page because the graphical interface is not displayed, so it is more efficient to run than a full browser.

If we combine selenium and PHANTOMJS, we can run a very powerful web crawler that can handle JavaScript, cookies, headers, and anything that our real users need to do.

4. Splash

Splash is a JavaScript rendering service. It is a lightweight browser that implements the HTTP API, and Splash is implemented in Python, using both twisted and QT. Twisted (QT) is used to enable the service to have asynchronous processing capability to perform webkit concurrency.

Python connects Splash's library called Scrapy-splash,scrapy-splash using the splash HTTP API, so a splash instance is required, typically using Docker to run Splash, So you need to install Docker.

5. Spynner

Spynner is a Qtwebkit client that simulates a browser, finishes loading pages, raises events, fills out forms, and so on.

Crawler anti-masking strategy 1. Modify User-agent

User-agent is one of the most common means of disguising browsers.

User-agent refers to a string that contains browser information, operating system information, and so on, also known as a special network protocol. The server uses it to determine whether the current object is a browser, a mail client, or a web crawler. In Request.headers, you can view user-agent, about how to analyze a packet, view its user-agent, and so on, as mentioned in the previous article.

The specific method can change the value of the user-agent to the browser, even can set a user-agent pool (list, array, dictionary can), hold multiple "browser", each crawl when randomly take a to set the request User-agent, So user-agent will always be changing to prevent the wall from being.

2. Cookies are prohibited

Cookies are actually stored in the user terminal of some encrypted data, some websites through the use of cookies to identify the user identity, if a visit is always high frequency of the request, it is likely to be noticed by the site, is suspected as a crawler, The site can then deny access to the user by using a cookie to find the access.

By prohibiting cookies, this is the client actively blocking the server from writing. The prohibition of cookies prevents us from being banned from websites that may use cookies to identify crawlers.

In the Scrapy crawler can be set cookies_enables= FALSE, that is, do not enable cookies middleware, do not send cookies to Web server.

3. Set the request time interval

Large-scale centralized access has a large impact on servers, and crawlers can increase server load for a short time. Note here: Set the download wait time range control, waiting time is too long, can not meet the requirements of a short time large-scale crawl, waiting time is too short is likely to be denied access.

Set reasonable request time interval, not only to ensure the crawler crawl efficiency, but also do not have a large impact on the other server.

4. Proxy IP Pool

In fact, micro-knowledgeable other is IP, not account. In other words, there is no point in simulating a login when it is necessary to fetch a lot of data continuously. As long as the same IP, no matter how to change the account is not used, the main is to change the IP.

One of the Web server's strategies for crawling is to block the IP or the entire IP segment from blocking access, and when the IP is banned, it can be switched to another IP for further access. Method: Proxy IP, local IP database (using IP pool).

5. Using Selenium

The use of selenium to simulate manual click access to the site is a very effective way to prevent the ban. However, selenium is inefficient and not suitable for large-scale data capture.

6. Crack Verification Code

Verification codes are now the most common means of preventing reptiles. The ability of small partners can write their own algorithm to crack the verification code, but generally we can spend some money using the interface of the third-party coding platform, easy to achieve verification code crack.

Conclusion

The above is a bit of the Python crawler comb, specific to a technical point need to check the details of their own. Hope for the students to learn a reptile a little help.

Python Crawler Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More