Web site common anti-crawler and Coping methods (turn)

Source: Internet
Author: User

  

In our 2016 Big Data industry forecast article "2016 Big data will go down the altar embracing life capital favored entrepreneurial opportunities" in, we have mentioned "in 2016, to prevent site data crawling will become a business." ”。 Today, I found an article from "BSDR", the article mainly introduces the common anti-crawler coping methods, the following is the text.

Common anti-crawler

These days in crawling a website, the site did a lot of anti-reptile work, climbed up a little difficult, took some time to bypass the anti-crawler. This is a summary of the various anti-crawler strategies and coping methods I have encountered since I wrote reptiles.

In terms of function, reptiles are generally divided into data collection, processing, storage three parts. Here we only discuss the Data acquisition section.

General Web site from three aspects of anti-crawler: User request headers, user behavior, site directory and data loading mode. The first two are relatively easy to come by, and most websites are anti-crawlers from these angles. The third type of AJAX-based Web site is used, which increases the difficulty of crawling.

  By headers anti-crawler

The headers anti-crawler from the user request is the most common anti-crawler strategy. Many sites will be headers user-agent detection, there are some sites will be referer detection (some resource site's anti-theft chain is detection referer). If you encounter this kind of anti-creeping mechanism, you can add headers directly to the crawler, copy the browser's user-agent into the headers of the crawler, or modify the Referer value to the target website domain name. For the detection of headers anti-crawler, in the crawler to modify or add headers can be very good bypass.

  Anti-crawler based on user behavior

There are also some sites that detect user behavior, such as multiple accesses to the same page within a short time of the same IP, or the same account multiple times within a short period of time.

Most Web sites are the former, and in this case, IP proxies can be used to resolve them. can be specially written a crawler, crawling online public proxy IP, after the detection of all saved up. This proxy IP crawler is often used, it is best to prepare their own one. With a large number of proxy IP can be a few times per request to replace an IP, which is easy to do in requests or URLLIB2, so it is easy to bypass the first anti-crawler.

In the second case, the next request can be made at random intervals of several seconds after each request. Some of the logical vulnerabilities of the site, can be requested several times, log out, re-login, continue to request to bypass the same account for a short period of time can not make the same request multiple times limit.

  Anti-crawler for dynamic pages

Most of the above are in static pages, and there are some sites where the data we need to crawl is obtained through AJAX requests or generated through Java. First, the network request is analyzed with Firebug or Httpfox. If we can find the AJAX request, can also analyze the specific parameters and the specific meaning of the response, we can use the above method, directly using requests or URLLIB2 simulation AJAX request, the response JSON analysis to get the data needed.

It's great to be able to directly emulate Ajax requests, but some websites encrypt all the parameters of the AJAX request. We simply have no way of structuring the data we need to request. This is the site I crawl these days, in addition to encrypting Ajax parameters, it also has some basic functions are encapsulated, all in the call their own interface, and interface parameters are encrypted. Encounter such a site, we can not use the above method, I use the SELENIUM+PHANTOMJS framework, invoke the browser kernel, and use Phantomjs to execute JS to simulate human operation and trigger the page JS script. From filling out the form to clicking the button to scrolling the page, all can be simulated, regardless of the specific request and response process, just finish the whole process of people to browse the page to get data simulation.

With this framework can almost bypass most of the anti-crawler, because it is not disguised as a browser to obtain data (the above by adding headers to a certain extent to disguise as a browser), it is itself a browser, PHANTOMJS is a browser without interface, It's not people who manipulate this browser. Use the SELENIUM+PHANTOMJS to do a lot of things, such as identifying Point-Touch (12306) or sliding verification code, the page form of brute force and so on. It will also be a skill in automated infiltration and will be mentioned later.

Web site common anti-crawler and Coping methods (turn)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.