International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python's anti-crawler strategy for resolving Web sites

Last Update:2016-04-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Web site's anti-crawler strategy:

In terms of function, reptiles are generally divided into data collection, processing, storage three parts. Here we only discuss the Data acquisition section.

General Web site from three aspects of anti-crawler: User request headers, user behavior, site directory and data loading mode. The first two are relatively easy to come by, and most websites are anti-crawlers from these angles. The third type of AJAX-based Web site is used, which increases the difficulty of crawling (preventing static crawlers from dynamically loading pages using AJAX technology).

1, from the user request headers Anti-crawler is the most common anti-crawler strategy.

Disguise the header. Many sites will be headers user-agent detection, there are some sites will be referer detection (some resource site's anti-theft chain is detection referer). If you encounter this kind of anti-crawler mechanism, you can directly add headers to the crawler, the browser's user-agent copied to the crawler headers, or the Referer value is modified to the target site domain [comments: Often easy to ignore, through the request packet analysis, Determine Referer, add in the impersonation Access request header in the program]. For the detection of headers anti-crawler, in the crawler to modify or add headers can be very good bypass.

2. Anti-crawler based on user behavior

There are also some sites that detect user behavior, such as multiple accesses to the same page within a short time of the same IP, or the same account multiple times within a short period of time. [This anti-crawling, need to have enough IP to deal with]

(1), most of the site is the former situation, in this case, the use of IP proxy can be resolved. can be specially written a crawler, crawling online public proxy IP, after the detection of all saved up. With a large number of proxy IP can be a few times per request to replace an IP, which is easy to do in requests or URLLIB2, so it is easy to bypass the first anti-crawler.

Crawl Proxy IP Code:

(2), for the second case, the next request can be made at random intervals of several seconds after each request. Some of the logical vulnerabilities of the site, can be requested several times, log out, re-login, continue to request to bypass the same account for a short period of time can not make the same request multiple times limit. [Comments: For the account to do anti-crawling restrictions, generally difficult to deal with, random seconds of requests may also often be blocked, if you can have more than one account, switch to use, better effect]

3, dynamic page of the anti-crawler

Most of the above are in static pages, and there are some sites where the data we need to crawl is obtained through AJAX requests or generated through Java.

Python's anti-crawler strategy for resolving Web sites

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python's anti-crawler strategy for resolving Web sites

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support