An analysis of anti-reptile strategy of internet website

Source: Internet
Author: User
Keywords Internet traffic we can analysis

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Because of the popularity of search engines, web crawler has become a very popular network technology, in addition to doing search Google,yahoo, Microsoft, Baidu, almost every large portal site has its own search engine, big and small call out the name of dozens of species, there are a variety of unknown thousands of tens of thousands of species, For a content-driven Web site, it is unavoidable to be patronized by web crawlers.

Some intelligent search engine crawler Crawl frequency is more reasonable, the site resources consumption is less, but a lot of bad web crawler, the ability to crawl the Web page is very poor, often concurrent with tens of hundreds of requests repeatedly crawl, this kind of crawler for small and medium-sized sites is often devastating blow, In particular, some programmers who lack the experience of crawler writing are very destructive. There was a time when I found in Javaeye's diary that a user is a Java crawler crawling nearly 1 million dynamic requests in one day. This is a JDK standard class library to write a simple crawl Web program, because the Javaeye site internal links constitute a loop caused the program into a dead loop. For Javaeye this million PV-level site, the crawler's access pressure will be very large, resulting in slow access to the site, or even inaccessible.

In addition, a considerable number of web crawlers are intended to steal the content of the target site. For example, the Javaeye site has been two rival web sites to crawl forum posts, and then in their own forums with a robot post, so this crawler not only affect the speed of Web access, but also violate the copyright of the site.

For an original content rich, URL structure is reasonable easy to crawl site, is simply a variety of reptiles of the dish, a lot of web site traffic composition, the flow of the crawler to bring far more than the real user access to traffic, and even the crawler traffic to a higher level of real traffic volume. The web site, like Javaeye, has a fairly strict anti-reptile strategy, but the number of dynamic requests processed by the website is still twice times that of the real user's traffic. To be sure, Internet traffic today at least 2/3 of the traffic crawler brought. Therefore, the anti-crawler is a worthwhile site for long-term exploration and resolution of the problem.

Manual identification and denial of crawler access

There are quite a few reptiles that can cause very high loads on the site, so it's easy to identify the source IP of a reptile. The easiest way to do this is to check the 80-port connection with Netstat:

Netstat-nt | grep youhostip:80 | awk ' {print $} ' | Awk-f ":" ' {print $} ' | Sort | uniq-c | Sort-r-N

This line of Shell can be based on the number of 80-port connection to the source IP sorting, so you can intuitively judge the web crawler. In general, the concurrent connections of reptiles are very high.

If you're using lighttpd as a Web Server, it's easier. LIGHTTPD's Mod_status provides very intuitive information about concurrent connections, including the source IP of each connection, the URL to access, the connection status and the connection time, and so on. Just check those high concurrent IP in the Handle-request state to quickly determine the source IP of the reptile.

Rejecting a crawler request can be rejected either through a kernel firewall or by a Web server denial, such as iptables:

Iptables-a input-i eth0-j drop-p TCP--dport 80-s

Direct blockade of the crawler in the C network segment address. This is because the General crawler is running in the host room, may be in a C section of multiple servers above all have reptiles, and this C segment can not be user broadband Internet access, block C section to a large extent to solve the problem.

Some people put forward a point of view of the brain, saying I want to punish these reptiles. I specifically in the Web design Dynamic Loop link page, let the crawler fell into a trap, the dead loop can not climb out, in fact, do not need to set traps, the mentally retarded crawler on the normal web site itself can not climb out, so do not say, but also let the real search engine to reduce your page rankings and running a crawler at all does not consume any machine resources, on the contrary, the real precious is your server CPU resources and server bandwidth, simple rejection of the request of the crawler is the most effective anti-crawler strategy. Second, to reject reptiles by identifying the user information of reptiles

There are a lot of reptiles will not be very high concurrent connection crawling, generally not easy to expose themselves; some reptiles source IP distribution is very wide, it is difficult to simply block IP address to solve the problem; There are a lot of different kinds of small reptiles, they are experimenting with innovative search methods outside Google, Each reptile crawls tens of thousands of of pages a day, dozens of of reptiles can consume millions of dynamically requested resources every day, because each small crawler alone is very low, so it is difficult to get it from the huge amount of access to the IP address every day to accurately dig it out.

In this case, we can identify the user information of the crawler. Each crawler declares its own user information when crawling pages, so we can dig and block the crawler by recording and analyzing user information. We need to record the user information for each request, and for rails we can simply add a global before_filter to the APP/CONTROLLERS/APPLICATION.RB. To log the user information for each request:

Logger.info "Http_user_agent #{request.env[" Http_user_agent "]}"

Then count the daily Production.log, extract user information, and find the user that are the most visited. It is important to note that we only focus on the user information of those reptiles, not the real browser user, so we have to eliminate the browser user, to do this requires only one line of Shell:

grep http_user_agent Production.log | Grep-v-E ' msie| Firefox| chrome| opera| safari| Gecko ' | Sort | uniq-c | Sort-r-N | Head-n > Bot.log

The statistical results are similar:

57335 http_user_agent baiduspider+ (+http://www.baidu.com/search/spider.htm)

56639 http_user_agent mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)

42610 http_user_agent Mediapartners-google

19131 http_user_agent msnbot/2.0b (+http://search.msn.com/msnbot.htm)

From the log you can visually see the number of requests per crawler. It's easy to block reptiles based on user information, LIGHTTPD configuration is as follows:

$HTTP ["useragent"] =~ "qihoobot|^java| Commons-httpclient| wget|^php| ruby| Python "{

Url.rewrite = ("^/(. *)" => "/crawler.html")

}

Using this approach to block the crawler is simple but very effective, in addition to blocking specific reptiles, but also to block the common programming language and HTTP class library user information, so that many useless programmers to practiced the crawler program to the site harassment.

There is a more common situation, that is, a search engine crawler crawling to the site is too high, but the search engine to the site brought a lot of traffic, we do not want to simply block the crawler, just want to reduce the crawler's request frequency, reduce the crawler's load on the site, then we can do:

$HTTP [user] =~ "baiduspider+" {

Connection.delay-seconds = 10

}

Baidu's crawler request delay 10 seconds to process, so that can effectively reduce the crawler load on the site.

Third, through the website traffic statistics System and log analysis to identify the crawler

Some reptiles like to modify user information to disguise themselves, disguise themselves as a real browser user information, so that you can not effectively identify. In this case, we can use the website traffic system record of the real user access IP to identify.

The mainstream web site traffic statistics System is no more than two implementation strategies: a strategy is embedded in the Web page of JS, this paragraph JS will send a request to a specific statistical server to record the number of visits; another strategy is to analyze the server log directly to count the amount of Web site traffic. In the ideal case, the embedded JS way of statistical site traffic should be higher than the Analysis server log, this is because the user browser will have caching, not necessarily every time the real user access will trigger the server processing. However, the actual situation is that the analysis of server log access to the site is far higher than the embedded JS mode, in extreme cases, even more than 10 times times higher.

Now many sites like to use Awstats to analyze the server log, to calculate the number of visits to the site, but when they use Google Analytics to statistics site traffic, but found that the flow of GA statistics is much lower than the Awstats, Why is there so much difference between GA and awstats statistics? The culprit is the web crawler that disguises itself as a browser. In this case awstats can not be effectively recognized, so the awstats statistics will be high.

In fact, as a Web site, if you want to know the actual number of visits to their website, want to accurately understand the site each channel access and access to users, should be embedded in the page with JS Way to develop their own web site traffic statistics system. Do a Web site traffic statistics System is a very simple thing, write segment server program to respond to customer segment JS request, analysis and recognition of the request and then write a log of the same time to do the background of the asynchronous statistics are done.

Through the flow statistics system to obtain the user IP is the basic real user access, because the crawler is generally unable to execute the page inside the JS code fragment. So we can compare the IP addresses of the IP and server program logs recorded by the traffic statistics system, if the server logs a large number of IP launched a request, in the flow statistics system can not find, or even if found, the number of visits is only a few, then no doubt is a network crawler.

Analysis Server log statistics access to the most IP address segment line shell can be:

grep 處理 Production.log | awk ' {print $} ' | Awk-f '. ' ' {print $. ' $ "." $ ". 0"} ' | Sort | uniq-c | Sort-r-N | Head-n > Stat_ip.log

Then the statistical results and flow statistics system records of the IP address, excluding the real user access to IP, and then exclude the web crawler we want to release, such as Google, Baidu, Microsoft MSN Crawler and so on. The final analysis results are the crawler's IP address. The following code snippet is a simple implementation hint:

Whitelist = []

Io.foreach ("#{rails_root}/lib/whitelist.txt") {|line| Whitelist << Line.split[0].strip if line}

Realiplist = []

Io.foreach ("#{rails_root}/log/visit_ip.log") {|line| realiplist << Line.strip if line}

IPList = []

Io.foreach ("#{rails_root}/log/stat_ip.log") do |line|

IP = Line.split[1].strip

IPList << IP if line.split[0].to_i > 3000 &&!whitelist.include? (IP) &&!realiplist.include? (IP)

End

Report.deliver_crawler (IPList)

Analysis Server log the number of requests more than 3,000 times the IP address section, excluding whitelist address and real access IP address, the end is the crawler IP, and then can send mail to inform the administrator to handle the corresponding.

Four, the website real time anti-crawler firewall implementation strategy

It is not a real-time anti crawler strategy to identify web crawler by analyzing logs. If a reptile is to target your site for deliberate crawling, then he may use a distributed crawl strategy, such as looking for hundreds of of thousands of foreign proxy servers frantically crawling your site, resulting in the site can not be accessed, then you analyze the log is not possible to solve the problem in a timely manner. Therefore, it is necessary to adopt the real-time anti-crawler strategy to be able to identify and block the crawler's access dynamically.

It's also easy to write one of these real-time anti-crawler systems. For example, we can use memcached to do access counters, record the frequency of access to each IP, in the unit time, if the frequency of visits over a threshold, we think that this IP is likely to have problems, then we can return a verification code page, asking the user to fill out the verification code. If it is a crawler, of course, can not fill in the verification code, so it was rejected, so simple to solve the problem of reptiles.

Using Memcache to record each IP access count, exceeding the threshold in a unit of time allows the user to fill in the code, and the example codes written in Rails are as follows:

Ip_counter = Rails.cache.increment (REQUEST.REMOTE_IP)

If!ip_counter

Rails.cache.write (REQUEST.REMOTE_IP, 1,: expires_in => 30.minutes)

elsif ip_counter > 2000

Render:template => ' Test ',: status => 401 and return False

End

This program is just the simplest example, the actual code implementation we will add a lot of judgments, for example, we may want to exclude the whitelist IP address segment, to allow specific user through, to the Logged-in users and non-logged-in users, For a referer address to take a different threshold and counting accelerator and so on.

In addition, if the distributed crawler crawls too often, expiration allows the crawler to visit again or will cause a lot of pressure on the server, so we can add a policy: to require users to fill out the authentication code of the IP address, if the IP address in a short period of time continue to request, then judged as a reptile, add blacklist, All subsequent requests are rejected. To do this, the sample code can improve:

Before_filter:ip_firewall,: except =>: Test

def Ip_firewall

Render:file => "#{rails_root}/public/403.html",: status => 403 if Blacklist.include? (IP_SEC)

End

We can define a global filter, filter all requests, and the IP addresses that appear on the blacklist are rejected. Non-blacklist IP addresses are then counted and statistically:

Ip_counter = Rails.cache.increment (REQUEST.REMOTE_IP)

If!ip_counter

Rails.cache.write (REQUEST.REMOTE_IP, 1,: expires_in => 30.minutes)

elsif ip_counter > 2000

Crawler_counter = Rails.cache.increment ("Crawler/#{request.remote_ip}")

If!crawler_counter

Rails.cache.write ("Crawler/#{request.remote_ip}", 1,: expires_in => 10.minutes)

elsif Crawler_counter > 50

Blacklist.add (IP_SEC)

Render:file => "#{rails_root}/public/403.html",: Status => 403 and return False

End

Render:template => ' Test ',: status => 401 and return False

End

If the access frequency exceeds the threshold for an IP address per unit time, add a counter, tracking him will not immediately fill out the verification code, if he does not fill in the verification code, in a short time or high frequency access, the IP address to the blacklist, unless the user fills in the verification code activation, otherwise all requests are rejected. In this way we can maintain a blacklist in the program to dynamically track the situation of the crawler, and even we can write a background to manually manage the blacklist list, to understand the site crawler.

This strategy has been more intelligent, but not good enough! We can also continue to improve:

1, the use of Web site traffic statistics system to improve real-time anti-reptile system

Do you remember? Website traffic statistics system recorded IP address is the real user access IP, so we are in the website traffic statistics system inside also to operate memcached, but this time is not to increase the count value, but reduce the count. In the website traffic statistics system every receive an IP request, the corresponding cache.decrement (key). So for the real user's IP, its count is always plus 1 and then minus 1, not very high. In this way we can greatly reduce the value of the detection of the crawler, can be more rapid and accurate identification and rejection of the crawler.

2, using the time window to improve the real-time anti-reptile system

The crawler crawls the pages at a relatively constant frequency, unlike people to visit the Web page, the interval between the middle of the comparison has no rules, so we can give each IP address to create a time window, record the IP address of the last 12 access time, each record once slide window, compare the most recent access time and current time, If the interval is very long to judge is not a reptile, clear the time window, if the interval is not long, backtracking calculation of the specified time period of access frequency, if the frequency of access over the threshold, to the Verification code page to allow users to fill out the verification code.

Finally, the real-time anti-crawler system is perfect, it can quickly identify and automatically block the crawler's access to protect the site's normal access. But some reptiles may be quite cunning, it may be through a lot of crawler testing to explore your access threshold to crawl below the threshold crawling speed of your Web page, so we also need to assist the 3rd way, with the log to do the later analysis and recognition, even if the crawler crawl again slow, The cumulative amount of a day's crawl will also exceed your threshold and be identified by your Log Analyzer program.

In short, we comprehensively use the above four kinds of anti-reptile strategy, can greatly alleviate the crawler on the site caused by the negative impact, to ensure the normal access to the site.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.