Because of the popularity of search engines, web crawler has become a very popular network technology, in addition to the search Google,yahoo, Microsoft, Baidu, almost every large portal site has its own search engine, big and small called out the name of dozens of kinds, there are all kinds of unknown thousands of tens of thousands of kinds, For a content-driven website, being patronized by a web crawler is unavoidable.
Some intelligent search engine crawler Crawl frequency more reasonable, less on the site resources consumption, but a lot of bad web crawler, the crawl ability of the Web page is very poor, often concurrent tens of hundreds of requests to cycle repeatedly crawl, this crawler is often devastating impact on small and medium-sized web sites, In particular, some programmers who lack the experience of crawler writing out of the crawler is very destructive, resulting in a very large site access pressure, will lead to slow access to the site, or even inaccessible.
Manual identification and denial of crawler access
Quite a few crawlers can cause very high loads on the site, so it is easy to identify the source IP of the crawler. The simplest way to do this is to check the 80 port connection with Netstat:
netstat -nt | grep youhostip:80 | awk ‘{print $5}‘ | awk -F":" ‘{print $1}‘| sort | uniq -c | sort -r -n
This line of Shell can be based on the number of 80 port connections to the source IP order, so it can be intuitively judged out of the web crawler. In general, the concurrent connection of reptiles is very high.
If you use LIGHTTPD as a Web Server, it's much easier. LIGHTTPD's Mod_status provides very intuitive information about concurrent connections, including the source IP of each connection, the URL of the access, the connection status, and the connection time. Just check those high concurrent IPs in the Handle-request state to quickly identify the source IP of the crawler.
Rejecting a crawler request can be denied either through the kernel firewall or in Web server, for example by Iptables:
iptables -A INPUT -i eth0 -j DROP -p tcp --dport 80 -s 84.80.46.0/24
Directly block the address of the C segment where the crawler resides. This is because the General crawler is running in the host computer room, may be in a C-section of the multiple servers above have crawlers, and this C segment can not be a user broadband Internet access, blocking the C segment to a large extent to solve the problem.
To reject crawlers by identifying user-agent information from crawlers.
There are many reptiles do not crawl with very high concurrent connections, generally not easy to expose themselves, some reptiles source IP distribution is very wide, it is difficult to simply block the IP address to solve the problem, there are a lot of various small reptiles, they are trying to innovate outside Google search method, Each crawler crawls tens of thousands of of web pages per day, and dozens of crawlers add up to consume millions of dynamic requests every day, and because each creeper has a very low crawl volume, it's hard to get it out of the huge daily access IP address.
In this case, we can use the crawler's user-agent information to identify. Each crawler crawls Web pages and declares its own user-agent information, so we can dig and block crawlers by recording and analyzing user-agent information. We need to record the user-agent information for each request, and for rails we can simply add a global before_filter to the APP/CONTROLLERS/APPLICATION.RB. To record the user-agent information for each request:
logger.info "HTTP_USER_AGENT #{request.env["HTTP_USER_AGENT"]}"
Then count the daily Production.log, extract user-agent information, and find the most visited user-agent. Note that we only focus on the user-agent information of the crawler, not the actual browser user-agent, so also to eliminate the browser user-agent, to do this requires only a line of shell:
grep HTTP_USER_AGENT production.log | grep -v -E ‘MSIE|Firefox|Chrome|Opera|Safari|Gecko‘ | sort | uniq -c | sort -r -n | head -n 100 > bot.log
The statistical results are similar to this:
57335 HTTP_USER_AGENT Baiduspider+(+http://www.baidu.com/search/spider.htm) 56639 HTTP_USER_AGENT Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 42610 HTTP_USER_AGENT Mediapartners-Google 19131 HTTP_USER_AGENT msnbot/2.0b (+http://search.msn.com/msnbot.htm)
The number of requests per crawler can be seen intuitively from the log. It is easy to block crawlers based on user-agent information, and the LIGHTTPD configuration is as follows:
$HTTP["useragent"] =~ "qihoobot|^Java|Commons-HttpClient|Wget|^PHP|Ruby|Python" { url.rewrite = ( "^/(.*)" => "/crawler.html" )}
Using this method to block the crawler is simple but very effective, in addition to blocking specific crawlers, you can also block the common programming language and HTTP class library user-agent information, so you can avoid a lot of unnecessary programmers used to practiced hand the site of the crawler harassment.
There is a more common situation is that a search engine crawler to the site crawl frequency is too high, but the search engine to the site has brought a lot of traffic, we do not want to simply block the crawler, just want to reduce the frequency of crawler requests, reduce the crawler to the site caused by the load, then we can do this:
$HTTP["user-agent"] =~ "Baiduspider+" { connection.delay-seconds = 10}
The crawler request for Baidu delayed 10 seconds before processing, which can effectively reduce the crawler's load on the site.
Identify crawlers through website traffic statistics System and log analysis
Some reptiles like to modify user-agent information to disguise themselves, disguised as a real browser user-agent information, so that you can not effectively identify. In this case, we can use the real user access IP recorded by the website traffic system to identify.
The main web site traffic statistics System is two kinds of implementation strategy: a strategy is to embed a section of JS in the Web page, this JS will send a request to a specific statistical server to record the number of visits; the other strategy is to analyze the server log directly to count the number of site visits. In the ideal case, the method of embedding JS statistics should be higher than the Analysis server log, this is because the user browser will have a cache, not necessarily every time real user access will trigger the processing of the server. But the reality is that the analysis of server logs to get more site access than the embedded JS way, in extreme cases, even more than 10 times times higher.
Many websites now like to use Awstats to analyze server logs to calculate the number of visits to a website, but when they use Google Analytics to count website traffic, they find that the GA statistic is much less than awstats, Why is there such a big difference between GA and awstats statistics? The culprit is a web crawler that disguises itself as a browser. In this case awstats can not be effectively identified, so the awstats statistics will be high.
In fact, as a Web site, if you want to know the actual number of visits to their website, want to know exactly how each channel of the site and access users, you should use the page embedded JS way to develop their own website traffic statistics system. Do a Web site traffic statistics System is a very simple thing, write a server program to respond to customer segment JS request, analyze and identify the request and then write the log at the same time to do the background of the asynchronous statistics will be done.
The user IP that is obtained through the traffic statistics system is basically the real user access, because the crawler is unable to execute the JS code fragment inside the webpage normally. So we can take the traffic statistics system recorded IP and server program logging IP address comparison, if the server log in a large number of IP requests, in the traffic statistics system is not found at all, or even if found, the amount of access is only a few, then is undoubtedly a web crawler.
Analysis Server log statistics access to the highest number of IP address segments a shell is available:
grep Processing production.log | awk ‘{print $4}‘ | awk -F‘.‘ ‘{print $1"."$2"."$3".0"}‘ | sort | uniq -c | sort -r -n | head -n 200 > stat_ip.log
Then the statistical results and traffic statistics system recorded by the IP address comparison, to exclude the real user access to IP, and then exclude we want to release the web crawler, such as Google, Baidu, Microsoft MSN Crawler and so on. The final analysis results in a reptile IP address. The following code snippet is a simple implementation:
Whitelist = []Io.foreach"#{Rails_root}/lib/whitelist.txt ") {|line|Whitelist <<Line.split[0].StripIfLine}Realiplist = []Io.foreach"#{Rails_root}/log/visit_ip.log ") {|line|Realiplist <<Line.StripIfLine}IPList = []Io.foreach"#{Rails_root}/log/stat_ip.log ")do |line| IP = line. Split[1]. Strip iplist << IP if line. Split[0]. To_i > 3000 &&! Whitelist. include? (IP) &&! realiplist. include? (IP) end report.< span class= "identifier" >deliver_crawler (iplist)
Analyze the server log requests more than 3,000 times the IP address segment, exclude the whitelist address and the real access IP address, and finally get the reptile IP, and then send an email to notify the administrator to handle the corresponding.
Real-time anti-crawling firewall implementation strategy of Web site
It is not a real-time anti-crawler strategy to identify the web crawler by analyzing the log. If a crawler does not want to target your site for a deliberate crawl, then he may adopt a distributed crawl strategy, such as looking for hundreds of thousands of foreign proxy server frantically crawl your site, resulting in the site can not be accessed, then you can analyze the log is not possible to solve the problem in a timely manner. So it is necessary to take real-time anti-crawler strategy to dynamically identify and block the crawler's access in real time.
It's also easy to write a real-time anti-crawler system yourself. Let's say we can use memcached to do the access counter, record each IP access frequency, in the unit time, if the frequency of access more than one threshold, we think this IP is likely to have a problem, then we can return a verification code page, require users to fill in the verification code. If it is a crawler, of course, it is impossible to fill out the verification code, so it was rejected, so it is easy to solve the crawler problem.
Using Memcache to record each IP access count, exceeding the threshold per unit of time let the user fill in the verification code, the example code written in Rails is as follows:
Ip_counter =Rails.Cache.IncrementRequest.REMOTE_IP)if!Ip_counterRails.Cache.write (request. REMOTE_IP, 1, :expires_in = 30.minutes) elsif span class= "identifier" >ip_counter > 2000 render :template = ' test ', : status = 401 and return falseend
This program is just the simplest example, the actual code implementation we will add a lot of judgment, for example, we may want to exclude the whitelist IP address segment, to allow the specific user-agent through, for the login and non-logged users, Take different thresholds and counting accelerators for any referer addresses, and so on.
In addition, if the crawl frequency of the distributed crawler is too high, the expiration will allow the crawler to visit again or it will put a lot of pressure on the server, so we can add a policy: for the user to fill in the verification code of the IP address, if the IP address for a short period of time continue to request, then judge as a crawler, blacklist, All subsequent requests are rejected. To do this, the sample code can be improved:
before_filter :ip_firewall, :except = :testdef ip_firewall render :file = "#{rails_root}/public/403.html ": status = 403 if span class= "constant" >blacklist. include? (ip_sec) end
We can define a global filter that filters all requests, and the IP addresses that appear on the blacklist are rejected. The non-blacklist IP addresses are then counted and statistics:
Ip_counter =Rails.Cache.IncrementRequest.REMOTE_IP)if!Ip_counterRails.Cache.WriteRequest.REMOTE_IP,1,:Expires_in =30.MinuteselsifIp_counter >2000Crawler_counter =Rails.Cache.Increment"crawler/#{Request.REMOTE_IP} ")if!Crawler_counterRails.Cache.Write"crawler/#{Request.REMOTE_IP} ",1,:Expires_in =10.MinuteselsifCrawler_counter >50Blacklist.AddIP_SEC)Render:File ="#{rails_root}/public/403.html ", :status = 403 and return false end render :template = ' test ', : Status = 401 and return falseend
If an IP address unit time access frequency exceeds the threshold, and then add a counter, tracking he will not immediately fill out the verification code, if he does not fill in the verification code, in a short time or high frequency access, the IP address segment is blacklisted, unless the user fill out the verification code activation, otherwise all requests rejected. So we can be in the program to maintain the blacklist of the way to dynamically track the situation of the crawler, and even we can write a background to manually manage the blacklist list, to understand the situation of the site crawler.
On the functionality of this generic anti-crawler, we develop an open source plugin: Https://github.com/csdn-dev/limiter
This strategy has been more intelligent, but not good enough! We can also continue to improve:
1, using the website traffic statistics system to improve the real-time anti-crawler system
Do you remember? Website traffic statistics system recorded IP address is real user access IP, so we in the website traffic statistics system also to operate memcached, but this time not to increase the count value, but reduce the count. In the website traffic statistics system each receives an IP request, the corresponding cache.decrement (key). So for real users of the IP, its count value is always added 1 and then minus 1, not very high. So we can greatly reduce the value of the detection crawler, can be more quickly and accurately identify and reject the crawler.
2, using the time window to improve the real-time anti-crawler system
Crawler Crawl page frequency is relatively fixed, not like people to visit the page, the middle of the interval is more irregular, so we can give each IP address to establish a time window, record the last 12 times the IP address access time, each record once sliding window, compare the last access time and the current time, If the interval is a long time to judge is not a reptile, clear the window, if the interval is not long, back to calculate the frequency of access to the specified period of time, if the frequency of access exceeds the threshold, then to the verification code page to let users fill in the verification code.
Finally, this real-time anti-crawler system is quite perfect, it can quickly identify and automatically block crawler access to protect the normal site access. But some reptiles may be quite cunning, it may be through a lot of crawler testing to test out your access threshold, crawl speed below the threshold crawl your Web page, so we also need to assist the 3rd method, log to do later analysis and identification, even if the crawler crawl slow, It will accumulate more crawls per day than your threshold is recognized by your log analysis program.
In short, we use the above four kinds of anti-crawler strategy, can greatly alleviate the impact of the crawler on the site, to ensure that the site's normal access.
An analysis of anti-crawler tactics of internet website