How do I evaluate the crawler design that can automatically replace the User-Agent?

Source: Internet
Author: User
I wrote a crawler that crawled a pile of data on a website and automatically changed the UA. I felt very good, but I was afraid of being blocked by the target website. Please give me some better strategies ~ I wrote a crawler and crawled a pile of data on a website, and changed the UA automatically.

However, they are afraid of being blocked by the target website,
Please give me some better strategies ~ Reply content: the crawler automatically replaces the User-Agent with only one sentence in code implementation, which is not a difficult technical activity. Why do crawlers use different User-agents to simulate browsers and make it difficult for the server to recognize itself as a crawler.

For crawlers, you must go to the website's robots.txt file first to see what the website developer (or website owner) allows you to crawl and what you are not allowed to crawl. Then, do not do anything out of the box as the owner wishes. This is a feeling.

Of course, in most cases, we still hope that the content we need can be crawled regardless of the website owner's answer or not, which requires some anti-crawler mechanisms. The subject mentioned for U-A is actually a method, of course, only for U-A for most of the Anti crawler mechanism of the website is not enough.


Below are several Simple and commonAnti-crawler mechanism and solution.

The simplest thing is to do nothing. It is very easy to climb such a website.

A little block, may be verified under the User-Agent, the role of the U-A is to identify what you are browser/Client and so on. General programming language built-in network library or third-party network library, will set the U-A to their own identification, for such a situation the website developers can easily intercept. So there is the subject to change the U-A, in fact, is to replace the default U-A library with a commonly used browser U-A, In order to interfere with the role of server identification.

A little more complicated, it may not just be identified by the browser identity, but the visitor's IP address is also one of the identifiers. One problem with this anti-crawler mechanism is that if there are many real requests from the same IP address, it is easier to misjudge it as a crawler. Generally, the server identifies each user based on the IP address and Cookie. To Crawl such a website, a simple and effective solution is to change the proxy. Many websites provide free agents, capture them, and then send requests through the proxy (preferably the high-availability proxy), you can deal with most of the websites. It is worth noting that the validity period of the proxy is generally relatively short. Therefore, an effective mechanism should be provided after the proxy is crawled to verify the validity of the existing proxy.

More complex websites may be restricted by accounts. This type of website generally cannot fully view the required information before registering an account to log on. You must register and log on to the website before you can see the required information. For such a website, you need to register a batch of accounts. After an account is unavailable, you need to change to a new account. You can write a script to register accounts in batches. Of course, the verification code is usually accompanied during registration. For a simple verification code, you can search for the information by Google's OCR below. For a complicated verification code, you can enter the manual or buy the human flesh human bypass service.

More complex: some social websites. Most social websites are very closed. Sina Weibo is used as an example. Many parameters need to be submitted during Weibo login. The requested page after login is not the final source code, and you need to parse it yourself. For webpages with a large number of js rendering, it is generally complicated but easy (yes, these two words are not contradictory), you can resolve them by yourself, complex but not easy to do, just use some libraries (such as Python's selenium ).

........

There are still a lot in the middle. Let's end with a situation. This is probably an unsolved problem in the crawler field, that is, it is required to request all resources. Payment. Yes, you need to recharge your account for such websites and pay for each request. They don't care whether you are a machine or a person. They are charged anyway.



This is probably the case. The time is tight. Many vulnerabilities are summarized. Sorry... What is the use of UA .. Not to recognize you .. It's like you get a gift at the supermarket. You can only get one gift at a time. You can change your clothes and try again... Not only do people recognize it, but they also think that you have a funny feeling that LZ is a anti-crawler strategy that is too low...
I have used this anti-crawling strategy before graduation. It is useless for some large websites and has been arrested and banned in a few minutes.
Frequent IP address changes may also be useful. After all, the general big station will not block A large segment, and some crazy sections will block the entire A segment or the entire B segment, resulting in no data in the whole segment.
In addition, the data you obtained may be cached by XX broadband. You may have read a previous question, which is the most difficult BUG to tune, XX broadband will cache the dynamic data of the target website URL, which is so painful that the data you get may not be useful at all and has expired for a long time.
You 'd better think about other things and cheat the server from other perspectives. It's okay to see how other things crawl.
You can write anything like this by yourself. There are a lot of github articles and you can search for them by yourself. At least change the ip address... Is it not standard to change the user-agent automatically .... Today, I found that a crawler keeps mutating UA crawling us, And I blocked it from the IP address. I feel like you are asking this question, changing the ip address, changing the computer, and doing imitation simulation, and so on. The proxy IP address is better than this policy. You don't know you if you change the vest server.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.