Because of the popularity of search engines, web crawlers have become a popular network technology. In addition to Google, Yahoo, Microsoft, and Baidu, almost every large portal website has its own search engine, which can be named dozens, and hundreds of thousands of unknown websites, for a content-driven website, it is inevitable to be patronized by web crawlers.
Some smart search engine crawlers have a reasonable crawling frequency and consume less website resources. However, many poor web crawlers have poor crawling capabilities, hundreds of requests are repeatedly crawled concurrently. This type of crawler is often a devastating blow to small and medium websites. In particular, crawlers written by programmers who lack crawler programming experience are extremely destructive, the website access pressure may be very high, which may lead to slow Website access or even failure to access the website.
Manually identify and reject crawler access
A considerable number of crawlers pose a very high load on the website. Therefore, it is easy to identify the source IP address of the crawler. The simplest way is to use netstat to check the port 80 connection:
netstat -nt | grep youhostip:80 | awk '{print $5}' | awk -F":" '{print $1}'| sort | uniq -c | sort -r -n
This line of shell can sort the source IP addresses according to the number of port 80 connections, so that you can intuitively determine the web crawler. Generally, concurrent crawler connections are very high.
It is easier to use Lighttpd as a web server. The mod_status of Lighttpd provides intuitive information about concurrent connections, including the source IP address, access URL, connection status, and connection time of each connection, check the High-concurrency IP addresses in the handle-request State to quickly determine the crawler's source IP address.
To reject crawler requests, either through the kernel firewall or on the Web server, for example, using iptables:
iptables -A INPUT -i eth0 -j DROP -p tcp --dport 80 -s 84.80.46.0/24
Block the CIDR Block of the crawler. This is because crawlers generally run in the hosted data center and may have crawlers on multiple servers in the same C segment. However, this c segment cannot be broadband Internet access for users, blocking section C can solve the problem to a large extent.
The crawler is denied by identifying the User-Agent information of the crawler.
Many crawlers do not crawl with high concurrent connections and are generally not easy to expose themselves. Some crawlers have a wide distribution of source IP addresses and it is difficult to simply block IP segment addresses to solve the problem; there are also a variety of small crawlers that are trying innovative search methods other than Google. Each crawler crawls tens of thousands of webpages every day, dozens of crawlers can consume millions of dynamic requests every day. Because each crawler has a low crawling volume, therefore, it is difficult for you to extract it from the massive number of IP addresses accessed every day.
In this case, we can identify the crawler's User-Agent information. Every crawler declares its own User-Agent information when crawling a webpage. Therefore, we can record and analyze the User-Agent information to mine and block crawlers. We need to record the User-Agent information of each request. For rails, we can simply go to APP/controllers/application. add a global before_filter in Rb to record the User-Agent information of each request:
logger.info "HTTP_USER_AGENT #{request.env["HTTP_USER_AGENT"]}"
Then, collect the daily production. log, and extract the User-Agent information to find the most visited User-Agent. Note that we only pay attention to the User-Agent information of those crawlers, rather than the real browser User-Agent. Therefore, we also need to exclude the browser User-Agent. To do this, we only need a line of shell:
grep HTTP_USER_AGENT production.log | grep -v -E 'MSIE|Firefox|Chrome|Opera|Safari|Gecko' | sort | uniq -c | sort -r -n | head -n 100 > bot.log
The statistical results are similar to the following:
57335 HTTP_USER_AGENT Baiduspider+(+http://www.baidu.com/search/spider.htm) 56639 HTTP_USER_AGENT Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 42610 HTTP_USER_AGENT Mediapartners-Google 19131 HTTP_USER_AGENT msnbot/2.0b (+http://search.msn.com/msnbot.htm)
The log shows the number of requests of each crawler. It is easy to block crawlers Based on the User-Agent information. The Lighttpd configuration is as follows:
$HTTP["useragent"] =~ "qihoobot|^Java|Commons-HttpClient|Wget|^PHP|Ruby|Python" { url.rewrite = ( "^/(.*)" => "/crawler.html" )}
This method is simple but effective. In addition to blocking specific crawlers, you can also block the User-Agent information of common programming languages and HTTP class libraries, in this way, we can avoid the harassment of websites by crawlers who are useless programmers.
Another common situation is that crawlers of a search engine crawl websites too frequently. However, the search engine brings a lot of traffic to websites and we do not want to simply block crawlers, we only want to reduce the crawler request frequency and reduce the crawler's load on the website, so we can do this:
$HTTP["user-agent"] =~ "Baiduspider+" { connection.delay-seconds = 10}
The crawler requests of Baidu are processed after a delay of 10 seconds, which can effectively reduce the crawler's load on the website.
Identify crawlers through website traffic statistics system and Log Analysis
Some crawlers like to modify the User-Agent information to disguise themselves and disguise themselves as the User-Agent information of a real browser, So that you cannot effectively identify them. In this case, we can identify the real user access IP address recorded by the website traffic system.
The mainstream website traffic statistics system has two implementation strategies: one is to embed a piece of js in the webpage, which records the access volume by sending requests to the specific statistics server; another policy is to directly analyze server logs to count website traffic. Ideally, the website traffic statistics embedded in JS should be higher than the log of the Analysis Server. This is because the user's browser has a cache and does not necessarily trigger server processing every time a real user accesses the server. However, the actual situation is that the website access volume obtained by analyzing server logs is much higher than that obtained by embedding Javascript. In extreme cases, it is even more than 10 times higher.
Many websites now like to use AWStats to analyze server logs and calculate website traffic. However, once Google Analytics is used to count website traffic, however, we found that the traffic in GA statistics is much lower than that in AWStats. Why is there such a big difference between GA and AWStats statistics? The culprit is to disguise yourself as a web crawler of a browser. In this case, AWStats cannot be effectively identified, so the AWStats statistics are falsely High.
As a website, if you want to know the actual traffic of your website, and want to know the traffic and users of each channel, we should use JS embedded in the page to develop our website traffic statistics system. It is very easy to develop a website traffic statistics system by yourself. Write a segment server program to respond to the client segment JS request, analyze and identify requests, write logs, and perform background asynchronous statistics.
The user IP obtained through the traffic statistics system is basically real user access, because under normal circumstances, crawlers cannot execute js code snippets in the webpage. Therefore, we can compare the IP recorded by the traffic statistics system with the IP recorded in the server program log. If an IP in the server log initiates a large number of requests, it cannot be found in the traffic statistics system, or even if it can be found, there are only a few visits, it is undoubtedly a web crawler.
The following code analyzes the server log statistics on one shell line of the IP address segment with the most access:
grep Processing production.log | awk '{print $4}' | awk -F'.' '{print $1"."$2"."$3".0"}' | sort | uniq -c | sort -r -n | head -n 200 > stat_ip.log
Then compare the statistical results with the IP addresses recorded by the traffic statistics system, exclude the real user access IP addresses, and then exclude the web crawlers we want to allow, such as Google, Baidu, and Microsoft MSN crawlers. The final analysis result shows the IP address of the crawler. The following code snippet is a simple implementation example:
whitelist = []IO.foreach("#{RAILS_ROOT}/lib/whitelist.txt") { |line| whitelist << line.split[0].strip if line }realiplist = []IO.foreach("#{RAILS_ROOT}/log/visit_ip.log") { |line| realiplist << line.strip if line }iplist = []IO.foreach("#{RAILS_ROOT}/log/stat_ip.log") do |line| ip = line.split[1].strip iplist << ip if line.split[0].to_i > 3000 && !whitelist.include?(ip) && !realiplist.include?(ip)end Report.deliver_crawler(iplist)
Analyze the IP address segments with more than 3000 requests in the server log, exclude the whitelist address and the real access IP address, and finally obtain the crawler IP address, then, you can send an email to the Administrator for corresponding processing.
Real-time anti-crawler firewall implementation policies for websites
By analyzing logs, we can identify that web crawler is not a real-time anti-crawler policy. If a crawler has to focus on crawling your website, it may adopt a distributed crawling policy, for example, if you are looking for hundreds of thousands of foreign proxy servers to crawl your website so that the website cannot be accessed, you cannot solve the problem in time by analyzing logs. Therefore, real-time anti-crawler policies must be adopted to dynamically identify and block crawler access in real time.
Writing such a real-time anti-crawler system is actually quite simple. For example, we can use memcached as an access counter to record the Access frequency of each IP address. If the Access frequency exceeds a threshold within a unit of time, we think this IP address may be faulty, then we can return to a verification code page and ask the user to fill in the verification code. If it is a crawler, it is certainly not possible to enter the verification code, so it is rejected. This solves the crawler problem easily.
Use memcache to record the access count of each IP address. If the threshold value is exceeded per unit time, you can enter the verification code. The example code written in rails is as follows:
ip_counter = Rails.cache.increment(request.remote_ip)if !ip_counter Rails.cache.write(request.remote_ip, 1, :expires_in => 30.minutes)elsif ip_counter > 2000 render :template => 'test', :status => 401 and return falseend
This program is just the simplest example. We will add many judgments to the actual code implementation. For example, we may want to exclude the IP address segment of the whitelist and allow specific User-Agent to pass through, for login users and non-login users, different thresholds and counting accelerators should be adopted for Referer addresses.
In addition, if the crawling frequency of distributed crawlers is too high, it will put a lot of pressure on the server to allow crawlers to access the server again after expiration. Therefore, we can add a policy: for the IP address that requires the user to fill in the verification code, if the IP address continues to be requested for a short period of time, it will be regarded as a crawler, added to the blacklist, and all subsequent requests will be rejected. For this reason, the sample code can be improved as follows:
before_filter :ip_firewall, :except => :testdef ip_firewall render :file => "#{RAILS_ROOT}/public/403.html", :status => 403 if BlackList.include?(ip_sec)end
We can define a global filter to filter all requests and reject all IP addresses that appear in the blacklist. Count and count non-blacklisted IP addresses:
ip_counter = Rails.cache.increment(request.remote_ip)if !ip_counter Rails.cache.write(request.remote_ip, 1, :expires_in => 30.minutes)elsif ip_counter > 2000 crawler_counter = Rails.cache.increment("crawler/#{request.remote_ip}") if !crawler_counter Rails.cache.write("crawler/#{request.remote_ip}", 1, :expires_in => 10.minutes) elsif crawler_counter > 50 BlackList.add(ip_sec) render :file => "#{RAILS_ROOT}/public/403.html", :status => 403 and return false end render :template => 'test', :status => 401 and return falseend
If the access frequency of an IP address exceeds the threshold value within the unit time, add a counter and track whether the IP address enters the verification code immediately, this IP address segment is added to the blacklist. All requests are rejected unless the user enters the verification code for activation. In this way, we can dynamically track crawlers by maintaining the blacklist in the program. We can even write a background to manually manage the blacklist list to understand the website crawlers.
About this general anti-crawler function, we develop an open-source plug-in: https://github.com/csdn-dev/limiter
This policy is already intelligent, but not good enough! We can continue to improve:
1. Use the website traffic statistics system to improve the real-time anti-crawler System
Remember? The IP address recorded by the website traffic statistics system is the real user access IP address. Therefore, we also operate memcached in the website traffic statistics system, but this time we do not increase the Count value, but decrease the Count value. Each time an IP request is received in the website traffic statistics system, the corresponding cache. decrement (key) is used ). So for the real user's IP address, its Count value is always plus 1 and then minus 1, which cannot be very high. In this way, we can greatly reduce the threshold for identifying crawlers and quickly and accurately identify and reject crawlers.
2. Improve the real-time anti-crawler System Using Time Windows
Crawlers crawl webpages at a fixed frequency. Unlike people who visit webpages, the interval between them is relatively irregular. Therefore, we can create a time window for each IP address, record the last 12 access times of the IP address, and slide the window once every record. Compare the recent access time with the current time. If the interval is long, the window is cleared if it is not a crawler, if the interval is not long, calculate the Access Frequency of the specified period. If the Access frequency exceeds the threshold, go to the Verification code page and ask the user to enter the verification code.
In the end, this real-time anti-crawler system is quite complete. It can quickly identify and automatically block crawler access to protect normal Website access. However, some crawlers may be quite tricky. They may use a large number of crawler tests to test your access threshold and capture your webpage at a crawl speed lower than the threshold, therefore, we need to assist 3rd methods to use logs for later analysis and identification, even if crawlers crawl sloWly, the daily crawling volume exceeds your threshold and is identified by your log analysis program.
In short, we use the above four anti-crawler policies to mitigate the negative impact of crawlers on the website and ensure normal Website access.