Reptiles and anti-reptiles

Source: Internet
Author: User

Editor: This article from Ctrip Hotel research and development manager Tsui Kuang in the third issue of "Ctrip technology micro-sharing" on the sharing, the following is a summary of the content. Wall crack suggestions Click on video playback, "Live" onlookers hand siege lion big Cui, how high IQ & high eq perfect crush crawler ... Focus on Ctrip Technology Center Ctriptech, can be the first time to learn micro-sharing information ~

Have you ever been harassed by a reptile? When you see the "reptile" two words, is not already a little blood to the feeling of Zhang? Be patient, do something a little, and let them win in the name, and actually let them suffer.

First, why to anti-crawler

1, reptiles accounted for a high proportion of PV, such a waste of money (especially March reptiles).

What is the concept of the March reptile? Every year in March, we will greet a reptile peak.

At first we could not think of the solution. Until one time, April, we deleted a URL, and then a crawler crawling URLs, resulting in a lot of errors, testing began to find us trouble. We had to deliberately released a site for this crawler, the deleted URL back again.

But at that time, one of our team members said that we can not kill the crawler, it is just, but also specifically for its release, this is really too bad face. So an idea, said: URL can be on, but, absolutely not give real data.

So we posted a static file. The error stopped, the crawler did not stop, that is, the other side did not know that things are false. This has given us a great revelation, and has directly become the core of our anti-crawler technology: change.

Later, a student came to apply for an internship. We read the resume and found that she climbed over Ctrip. Then the interview was confirmed, and sure enough, she was the guy we released in April. But because it is a sister, technology is also good, then we pacified. Now it's almost officially in the job.

Later we discussed together, she mentioned, there are a large number of masters in writing papers will choose to crawl OTA data, and conduct public sentiment analysis. because May papers, so, everyone has read the book, you understand the March to grasp data, April analysis, May paper.

That's the rhythm.

2, the company can be free to query the resources of the bulk seized, lost competitiveness, so less money.

OTA prices can be directly queried in the non-login state, this is the bottom line. If forced to login, then can be blocked account of the way to let the other pay the price, which is also the practice of many sites. But we can't force each other to sign in. So if there is no anti-crawler, the other side can be bulk copy of our information, our competitiveness will be greatly reduced.

Competitors can catch our price, long time users will know, just to go to the competition where it can be, there is no need to ctrip. This is bad for us.

3. Are reptiles suspected of violating the law? If so, is it possible to prosecute claims for compensation? This will make money.

This question I specially consulted the legal affairs, finally found that this at home or a edge, is likely to prosecute the success, it may be completely ineffective. Therefore, it is necessary to use technical means to do the final protection.

Second, what kind of reptile

1, very low-level fresh graduates

The March reptile we mentioned at the beginning is a very obvious example. Recent graduates of the crawler is usually simple and rough, regardless of server pressure, plus the number of unpredictable, it is easy to hang the site.

By the way, it's no longer possible to climb Ctrip to get an offer. Because we all know that the first person to say beautiful women like flowers is genius. And the second one ... You know that, don't you?

2, very low-level entrepreneurial small companies

Now the start-up companies are more and more, do not know who is being fooled and then we start a business to find do not know what to do, think big data is hot, began to bigger data.

The analysis program was almost complete and found that there was no data at hand.

What to do? Write crawler crawl ah. So there are countless small reptiles, for the survival of the company's consideration, and constantly crawl data.

3, accidentally wrote wrong No one to stop the runaway small reptile

Ctrip's reviews sometimes can be as high as 60% of the number of visits is a crawler. We have opted for a direct blockade, and they are still tireless in crawling.

What do you mean? That is, they can not crawl any data, except HTTP code is 200, everything is wrong, but the crawler still does not stop this is probably some of the servers hosted on some of the small crawler, has been unclaimed, still working hard.

4. Formed commercial rivals

This is the biggest opponent, they have the technology, money, what to have what, if and you die, you can only bite the bullet and his death knock.

5, the search engine of convulsions

People do not think that search engines are good people, they also have convulsions, and a convulsions will lead to server performance degradation, request volume and network attack is no different.

Iii. What are reptiles and anti-reptiles

Because anti-crawlers are a relatively new area for the time being, there are some definitions to be themselves. Our internal definition is this:

    • Crawler: Use any technical means, a way to obtain the information of the website in bulk. The key is batch.

    • Anti-reptile: Use any technical means to prevent others from obtaining their own website information in a way. The key is also in bulk.

    • Accidental injury: In the process of anti-crawler, the mistake of the ordinary user identified as a crawler. The anti-creeping strategy with high rate of injury can not be used well.

    • Intercept: Successfully block crawler access. There will be a concept of interception rate. In general, the higher the interception rate of anti-crawler strategy, the more likely the risk of accidental injury. So we need to make a trade-off.

    • Resources: The sum of machine costs and labor costs.

It is important to remember that human cost is also a resource, and more than a machine. Because, according to Moore's law, machines are getting cheaper. And according to the development trend of IT industry, programmers pay more and more expensive. Therefore, let the other side work overtime is kingly, machine cost is not particularly valuable.

Four, the enemy: How to write a simple crawler

To be an anti-reptile, we need to know how to write a simple crawler first.

The crawler data found on the Web is very limited, usually just a python code. Python is a good language, but it is not the best choice to make crawlers for sites that have anti-crawler measures.

Ironically, the Python crawler code that is usually searched uses a Lynx user-agent. What are you supposed to do with this user-agent?

There are a few processes that you typically need to write a crawler:

    • Parse page request format

    • Create an appropriate HTTP request

    • Send HTTP requests in bulk, get data

For example, look directly at the Ctrip production URL. Click on the "OK" button on the details page and the price will be loaded. Assuming the price is what you want, then after the network request, which request is the result you want?

The answer is surprisingly simple, you only need to use the network to transmit the amount of data in reverse order. Because other confusing URLs are more complex, developers will not be willing to add the amount of data to him.

Five, the enemy: How to write Advanced crawler

So what should the reptile advance do? There are usually the following types of advanced steps:

Distributed

There are often some textbooks that tell you that in order to crawl efficiency, you need to deploy the crawler to multiple machines. It's all a lie. The only function of distribution is to prevent each other from blocking IP. Sealed IP is the ultimate means, the effect is very good, of course, the user is also very cool.

Analog JavaScript

Some tutorials will say that simulating JavaScript, crawling Dynamic Web pages, is an advanced technique. But in fact it's just a simple function. Because, if the other side does not have anti-crawler, you can directly grasp the AJAX itself, without concern about JS how to deal with. If the opponent has anti-crawlers, then JavaScript must be very complex, focusing on analysis, not just simple simulations.

In other words: This should be the basic skill.

Phantomjs

This is an extreme example. This thing is intended to be used for automatic testing, the result is very good, a lot of people take to do reptiles. But this thing has a mishap, that is: efficiency. In addition, PHANTOMJS can also be caught, for a number of reasons, here for the time being.

Six, the advantages and disadvantages of different levels of crawler

The more low-level reptiles, the more easily blocked, but good performance, low cost. The more advanced the crawler, the more difficult to be blocked, but the performance is low, the higher the cost.

When the cost is high to a certain extent, we can no longer need to block the crawler. There is a word in economics called marginal effect. Pay the cost high to a certain degree, the income is not many.

Then if we compare the resources of both sides, we will find that unconditionally with each other's death, is not cost-effective. There should be a golden point, more than this point, then let it climb well. After all, we are not anti-reptiles for the sake of face, but for business reasons.

Vii. How to design an anti-crawler system (general architecture)

A friend once gave me such a structure:

1, the request pretreatment, easy to identify;
2, to identify whether it is a crawler;
3, for the identification results, the appropriate treatment;

At the time, I thought it sounded reasonable and it was a structure, and the idea was not the same as ours. And then we really reacted to the wrong thing. Because:

If you can identify a reptile, where is there so much nonsense? You can do it if you want to do it. If you do not identify the crawler, who do you do appropriate treatment?

There are two sentences in the three sentences is nonsense, only a useful one, and has not given the specific implementation of the way. So: What is the use of this architecture (division)?

Because there is an architect cult problem at the moment, many start-up small companies develop in the name of architects. The title given is: The primary architect, the architect itself is a senior position, why there is a primary architecture. This is equivalent to: Junior General/Junior commander.

Finally went to the company, found 10 people, a CTO, nine architects, and maybe you are a junior architect, others or senior architect. But the junior architect is not the pit father, some small start-up companies also recruit CTO to do development.

Traditional anti-reptile means

1, the background of the access statistics, if a single IP access exceeds the threshold, to be blocked.

Although the effect is good, but actually there are two defects, one is very easy to hurt ordinary users, and the other is that the IP is actually not valuable, dozens of dollars may even buy hundreds of thousands of IP. So the overall is relatively deficient. But for the March crawler, this is still very useful.

2, the background of the access statistics, if a single session access exceeded threshold, to be blocked.

This looks a bit more advanced, but in fact the effect is worse, because the session is completely worthless, re-apply for a can.

3, the background of access statistics, if a single useragent access exceeded threshold, to be blocked.

This is a big trick, similar to antibiotics and the like, the effect surprisingly good, but the damage is too large, the injury is very serious, use the time to be very careful. So far we have only briefly blocked Firefox under the Mac.

4, the combination of the above

Combination of the ability to become larger, the injury rate dropped, in the encounter of low-level reptiles, or relatively useful.

From the above we can see, in fact, crawler anti-crawler is a game, RMB player is the most awesome. Because the above mentioned method, the effect is general, so still use JavaScript more reliable.

Perhaps some people will say: JavaScript do, can not jump off the front-end logic, direct pull service? How can it be reliable? Because Ah, I am a title party AH. JavaScript is more than just a front end. Skipping the front end does not equal skipping JavaScript. That is to say: Our server is made by Nodejs.

Study questions: When we write code, what kind of code are we afraid to encounter? What code is not good for debugging?

Eval

Eval is notorious for its inefficiency and poor readability. That's what we need.

Goto

JS support for Goto is not good, so you need to implement Goto yourself.

Confuse

The current Minify tool is usually a simple name minify into ABCD, which does not meet our requirements. We can minify into better use, such as Arabic. Why is it? Because Arabic is sometimes written from left to right, sometimes from right to left, and sometimes from bottom to top. Unless the other party hires an Arab programmer, it is not a headache to die.

Unstable code

What bugs are not easy to fix? Not easy to reproduce the bug is not good repair. As a result, our code is fraught with uncertainty and is different every time.

Code Demo

Downloading the code itself can be easier to understand. Here is a brief introduction to the following ideas:

    1. Pure JavaScript Anti-crawler demo, by changing the connection address, to let the other party crawl to the wrong price. This method is simple, but if the other side of the targeted to view, it is very easy to find.

    2. Pure JavaScript Anti-crawler demo, change key. This practice is simple and not easy to find. But it can be achieved by intentionally crawling the wrong price.

    3. Pure JavaScript Anti-crawler demo, change dynamic key. This approach can change the cost of the key to 0, so the cost is lower.

    4. Pure JavaScript Anti-crawler demo, very complex to change key. This method can make the other party difficult to analyze, if added to the subsequent mentioned browser detection, more difficult to be crawled.

End

We mentioned the marginal effect, that is, we can stop here. The subsequent re-investment in manpower is not worth the candle. Unless there is a special opponent with you to die knock. But this is the time to fight for dignity, not for business reasons.

Browser detection

For different browsers, our detection method is not the same.

    • IE detect bugs;

    • The degree of rigor of the FF detection standards;

    • Chrome detects powerful features.

Eight, I got you--then what?

No production events are raised--direct interception

May cause production Events-release data (also known as poisoning)

In addition, there are some divergent ideas. For example, is it possible to do SQL injection in a response? After all, it is the hand that the other person moves first. However, the law does not give a specific reply, and it is not easy to explain to her. So for the time being only conceived.

1. Technical repression

As we all know, the DotA Ai has a de command, and when the AI is killed, it gains a multiplier of experience. Therefore, the early killing Ai too much, AI will be a god-installed, unable to kill.

The right thing to do is to suppress the opponent's rank, but not kill it. Anti-crawler is the same, do not start too much, forcing others and you die.

2. Psychological warfare

Provocative, compassionate, sarcastic, wretched.

Above not mention, we understand the spirit can.

3. Water drainage

This may be the highest state.

Programmers are not easy, especially for crawlers. Poor pity they gave them a small bite to eat. Maybe in a few days you will do a good job of anti-reptiles, diverted to do reptiles.

For example, a while ago someone asked me if I would be a reptile ... I am such a kind person, can say not????

Reptiles and anti-reptiles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.