Python's anti-crawler strategy for resolving Web sites

Source: Internet
Author: User

Web site's anti-crawler strategy:

In terms of function, reptiles are generally divided into data collection, processing, storage three parts. Here we only discuss the Data acquisition section.

General Web site from three aspects of anti-crawler: User request headers, user behavior, site directory and data loading mode. The first two are relatively easy to come by, and most websites are anti-crawlers from these angles. The third type of AJAX-based Web site is used, which increases the difficulty of crawling (preventing static crawlers from dynamically loading pages using AJAX technology).

1, from the user request headers Anti-crawler is the most common anti-crawler strategy.

Disguise the header. Many sites will be headers user-agent detection, there are some sites will be referer detection (some resource site's anti-theft chain is detection referer). If you encounter this kind of anti-crawler mechanism, you can directly add headers to the crawler, the browser's user-agent copied to the crawler headers, or the Referer value is modified to the target site domain [comments: Often easy to ignore, through the request packet analysis, Determine Referer, add in the impersonation Access request header in the program]. For the detection of headers anti-crawler, in the crawler to modify or add headers can be very good bypass.

2. Anti-crawler based on user behavior

There are also some sites that detect user behavior, such as multiple accesses to the same page within a short time of the same IP, or the same account multiple times within a short period of time. [This anti-crawling, need to have enough IP to deal with]

(1), most of the site is the former situation, in this case, the use of IP proxy can be resolved. can be specially written a crawler, crawling online public proxy IP, after the detection of all saved up. With a large number of proxy IP can be a few times per request to replace an IP, which is easy to do in requests or URLLIB2, so it is easy to bypass the first anti-crawler.

Crawl Proxy IP Code:

(2), for the second case, the next request can be made at random intervals of several seconds after each request. Some of the logical vulnerabilities of the site, can be requested several times, log out, re-login, continue to request to bypass the same account for a short period of time can not make the same request multiple times limit. [Comments: For the account to do anti-crawling restrictions, generally difficult to deal with, random seconds of requests may also often be blocked, if you can have more than one account, switch to use, better effect]

3, dynamic page of the anti-crawler

Most of the above are in static pages, and there are some sites where the data we need to crawl is obtained through AJAX requests or generated through Java.

Python's anti-crawler strategy for resolving Web sites

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.