Common Anti-crawler and Countermeasures for websites
In our PREDICTION Article on the big data industry in 2016, "in 2016, big data will go down the altar and embrace the opportunities of life capital to favor entrepreneurship", we once mentioned that "in 2016, preventing website data crawling will become a business. ". Today, I found an article from "BSDR", which mainly introduces common anti-crawler methods. The following is the text.
Common Anti-Crawler
After crawling a website over the past few days, the website has done a lot of anti-crawler work, which is a little difficult and takes some time to bypass anti-crawler. Here we will summarize the anti-crawler policies and methods I have encountered since I wrote the crawler.
In terms of functions, crawlers are generally divided into three parts: data collection, processing, and storage. Here we will only discuss the data collection section.
Websites generally have anti-crawler in three aspects: Headers of user requests, user behavior, website directories, and data loading methods. The first two are relatively easy to use. Most websites use these anti-crawler methods. The third type is used by some websites that use ajax, which increases the difficulty of crawling.
Anti-crawler through Headers
Headers anti-crawler in user requests is the most common anti-crawler policy. Many websites detect Headers User-Agent, and some websites detect Referer (some resource websites refer to Referer for anti-leech detection ). If you encounter this type of Anti-crawler mechanism, you can directly add Headers to the crawler, copy the browser's User-Agent to the crawler's Headers, or change the Referer value to the target website domain name [comment: it is often overlooked. Through packet capture analysis of the request, the referer is determined, and the request header is simulated in the program.] For anti-crawler that detects Headers, you can modify or add Headers in the crawler to bypass it.
Anti-crawler based on user behavior
Some websites detect user behaviors, such as multiple accesses to the same page from the same IP address within a short period of time, or multiple operations on the same account within a short period of time. [This anti-crawling method requires enough ip addresses.]
Most websites are in the previous situation. In this case, IP proxy can be used. You can write a crawler to crawl the public proxy ip address on the Internet, and store all the information after detection. This proxy ip crawler is often used. You 'd better prepare one by yourself. With a large number of proxy ip addresses, you can change one ip address several times each request, which is easily achieved in requests or urllib2, so that the first anti-crawler can be easily bypassed. [Comment: Dynamic dialing is also a solution]
In the second case, you can perform the next request at random intervals of several seconds after each request. For websites with logical vulnerabilities, you can log out and log on again after several requests. You can continue to send requests to bypass the restriction of multiple requests for the same account in a short time. [Comment: Anti-crawling restrictions on accounts are generally difficult to cope with. Random requests may be blocked for a few seconds. If you have multiple accounts, switching to use is better.]
Dynamic page anti-Crawler
Most of the above situations are found on static pages, and some websites, the data we need to crawl is obtained through ajax requests or generated through Java. First, use Firebug or HttpFox to analyze network requests. [comment: I feel that google and IE are using network request analysis quite well.] If we can find ajax requests and analyze the specific parameters and meanings of the response, we can use the above method to directly simulate ajax requests using requests or urllib2, analyze the response json to obtain the required data.
It is excellent to directly simulate ajax requests to obtain data, but some websites encrypt all parameters of ajax requests. We have no way to construct the data requests we need. The website I crawled over the past few days is like this. In addition to encrypting ajax parameters, it also encapsulates some basic functions, all of which are calling its own interfaces, interface parameters are encrypted. When we encounter such a website, we cannot use the above method. I use the selenium + phantomJS framework to call the browser kernel, phantomJS is used to execute js to simulate human operations and trigger js scripts on the page. You can simulate the process from filling out the form, clicking the button, and then rolling the page. without considering the specific request and response process, you can simply simulate the process of retrieving data from the page. [Comment: Support for phantomJS]
Using this framework can almost bypass most anti-crawlers, because it is not disguised as a browser to obtain data (the preceding scheme is disguised as a browser by adding Headers ), it is a browser, and phantomJS is a browser without any interface, but it is not a person who controls the browser. Using selenium + phantomJS can do a lot of things, such as recognizing spot-touch (12306) or slide-mobile verification codes and brute-force cracking of Page forms. It will make great achievements in automated penetration and will be mentioned later.