Crawler anti-Reverse crawler: Turn

Source: Internet
Author: User

Copyright belongs to the author.
Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Xlzd
Links: http://www.zhihu.com/question/34980963/answer/60627797
Source: Know

Crawler automatic User-agent in code implementation only need a sentence is enough, is not a difficult technical work. Why the crawler to change the different user-agent, nothing more than to emulate the browser, so that the server is not easy to identify itself is a reptile.

For reptiles, in fact, the best policy is to crawl the site before you read the site's "robots.txt" file, to see the site developers (or site owners) allow you to climb what does not allow you to climb what, and then according to the master's wishes, don't make lattice things. This is really a feeling.

Of course, in fact, most of the cases we still want to be able to regardless of the site owners do not agree to climb down the content we need, this need to use some anti-anti-crawler mechanism. The title of the main mention of U-a is actually a method, of course, just change u-a for most of the anti-crawler site is not enough.


Here are a few Simple and commonAnti-crawler mechanisms and solutions.

The simplest nature is nothing to do, so the site is very simple to climb up, do not repeat.

A little stop, you may check the user-agent, the role of u-a is to identify what browser you are/client and so on. Common programming language built-in network libraries or third-party network libraries, will be set U-A to its own identity, in such a case, the site developers can easily intercept. So there is the title of the Lord said change U-a, in fact, the library default u-a replaced with the u-a of common browsers, in order to achieve the role of interference server recognition.

More complicated, perhaps not only through the browser identity to identify, the visitor's IP is also one of the identities. This anti-crawler mechanism has a problem is that if the real request from the same IP to go out a lot, it is easier to misjudge as a reptile. The general server will identify each user according to Ip+cookie. To crawl such a site, a simpler and more effective solution is to change agents. Many websites offer free proxies, grab them, and then send a request through an agent (preferably a high-stealth proxy) that can handle most of the site. It is worth noting that the agent's effective time is generally relatively short, so after the crawl agent should have an effective mechanism to verify the effectiveness of the existing agent.

More complex sites may be limited by account numbers. Such websites generally do not have full access to the required information before registering for the account, and must be registered and logged in to see it. Such a site will need to register a number of accounts, after an account is not available to replace the new account. You can write a script to register your account in bulk. Of course, when the registration of the general will accompany the verification code, for a simple verification code can google the following OCR to find information, complex verification code personal overdraft lose or buy meat code service bar.

More complicated is the social networking sites. Social networking sites are mostly very closed, take Sina Weibo as an example, micro-blog in the login need to submit a lot of parameters, login after the requested page is not the final source, but also need to resolve their own again. For a large number of JS rendering Web pages, generally complex but easy (yes, these two words do not contradict), you can parse, complex but not easy to do through a few libraries (such as Python selenium, etc.).

........

There are many in the middle not summed up, and finally a situation. This is probably the puzzle of the reptile world, is to ask all the resources need Pay。 Yes, this kind of website you need to recharge in your account, and then every request is paid, they don't care whether you are a machine or a person, anyway charge.



This is probably the case, the time is tight, summed up a lot of loopholes, please forgive me.

Crawler anti-Reverse crawler: Turn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.