How to evaluate the crawler design that can automatically replace User-agent?

Source: Internet
Author: User
I wrote a crawler crawl a site a bunch of data, automatically change UA, feel great

But I'm afraid to be blocked by the target website,
The gods give you some better strategies.

Reply content:

Crawler automatic User-agent in code implementation only need a sentence is enough, is not a difficult technical work. Why the crawler to change the different user-agent, nothing more than to emulate the browser, so that the server is not easy to identify itself is a reptile.

For reptiles, in fact, the best policy is to crawl the site before you read the site's "robots.txt" file, to see the site developers (or site owners) allow you to climb what does not allow you to climb what, and then according to the master's wishes, don't make lattice things. This is really a feeling.

Of course, in fact, most of the cases we still want to be able to regardless of the site owners do not agree to climb down the content we need, this need to use some anti-anti-crawler mechanism. The title of the main mention of U-a is actually a method, of course, just change u-a for most of the anti-crawler site is not enough.


Here are a few Simple and commonAnti-crawler mechanisms and solutions.

The simplest nature is nothing to do, so the site is very simple to climb up, do not repeat.

A little stop, you may check the user-agent, the role of u-a is to identify what browser you are/client and so on. Common programming language built-in network libraries or third-party network libraries, will be set U-A to its own identity, in such a case, the site developers can easily intercept. So there is the title of the Lord said change U-a, in fact, the library default u-a replaced with the u-a of common browsers, in order to achieve the role of interference server recognition.

More complicated, perhaps not only through the browser identity to identify, the visitor's IP is also one of the identities. This anti-crawler mechanism has a problem is that if the real request from the same IP to go out a lot, it is easier to misjudge as a reptile. The general server will identify each user according to Ip+cookie. To crawl such a site, a simpler and more effective solution is to change agents. Many websites offer free proxies, grab them, and then send a request through an agent (preferably a high-stealth proxy) that can handle most of the site. It is worth noting that the agent's effective time is generally relatively short, so after the crawl agent should have an effective mechanism to verify the effectiveness of the existing agent.

More complex sites may be limited by account numbers. Such websites generally do not have full access to the required information before registering for the account, and must be registered and logged in to see it. Such a site will need to register a number of accounts, after an account is not available to replace the new account. You can write a script to register your account in bulk. Of course, when the registration of the general will accompany the verification code, for a simple verification code can google the following OCR to find information, complex verification code personal overdraft lose or buy meat code service bar.

More complicated is the social networking sites. Social networking sites are mostly very closed, take Sina Weibo as an example, micro-blog in the login need to submit a lot of parameters, login after the requested page is not the final source, but also need to resolve their own again. For a large number of JS rendering Web pages, generally complex but easy (yes, these two words do not contradict), you can parse, complex but not easy to do through a few libraries (such as Python selenium, etc.).

........

There are many in the middle not summed up, and finally a situation. This is probably the puzzle of the reptile world, is to ask all the resources need Pay。 Yes, this kind of website you need to recharge in your account, and then every request is paid, they don't care whether you are a machine or a person, anyway charge.



This is probably the case, the time is tight, summed up a lot of loopholes, please forgive me ... What's the use of UA? It's not that I can't identify you. It's like you're picking up a giveaway in a supermarket, you can only get one once, you change your clothes and go ... People not only recognize it, but also think you funny feeling LZ this anti-crawling strategy too low ...
This anti-crawling strategy I have not graduated from the use of some of the big site is useless, not a few minutes to be caught out of the ban.
Frequent change of IP may also be a bit of use, after all, the general station will not lunar large sections, excluding some crazy will seal the whole of a or whole B-segment resulting in a whole piece of data.
And you take down the data may be XX broadband cache, you see a question before, is the most difficult to adjust the bug, inside is such, XX broadband will cache the target site URL Dynamic Data, the egg is very painful, so you take the data may not be useful, long overdue.
You'd better think about something else, cheating on the server from other angles, it's okay to look at other people's things how to crawl.
You can write a person like this, a lot of GitHub on the search for their own look. At least a different IP ... Auto-change user-agent is not standard .... Today found a crawler constantly changing UA in crawling us, and then I put it IP sealed, feel very good with you ask this question, change the IP, change the computer, high imitation simulation, and so on and so on are useless. Proxy IP is better than this strategy. You think the server doesn't know you anymore. For UA Boring and 2B
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.