How to add an elegant anti-crawler strategy to your website

Source: Internet
Author: User

Your website content is very valuable, want to be Google, Baidu and other regular search engine crawler included, but do not want those without moral integrity of the cottage crawler to take your data off the free. This article will explore how to add an elegant anti-crawler strategy to your website.

Ideas

The following points are considered in the anti-crawler strategy:

    1. Can be Google, Baidu and other regular search engine crawler crawl, unlimited traffic and concurrency number;

    2. Prevent the crawling of cottage crawlers;

    3. Anti-crawler strategy should be real-time detection, rather than through a period of time after the analysis of access statistics;

    4. humanized treatment after miscarriage of judgment (where elegance resides);


Most of the crawlers are not browser-based access to the page, the crawler only download the HTML source of the Web page, do not load the page contained in the js/css/picture, which is a key to distinguish the crawler or not. A request is identified not browser access, must be a crawler, in order to meet the above mentioned 1th and 2nd, further to the HTTP header agent verification, whether marked as Google, Baidu Spider, strict point should be judged whether the source IP is Google, Baidu's reptile IP, these IP can be found on the Internet. Verify that the IP is not whitelisted to block access to the content.

Of course, some crawlers crawl content in a browser-loaded way, so even if it is recognized as a source IP for browser access. Also to detect this IP in a time slice of the number of concurrent, more than a certain threshold, can be considered a crawler, blocking access to content.

Because our anti-crawler strategy is based on IP, there will be false positives, especially the discrimination of concurrency limit. We need a friendly way to stop the visit. Directly return 50x/40x Blank or error page is very rude, when the real user is misjudged to block access can be manually unlocked to continue access is the more elegant way, we invariably think of the verification code, right! Let the user input the verification code in the graph to unlock, but we usually see the verification code is still barbaric, verification code technology from the beginning of the simple numbers, the development of the input of Chinese characters today, the input of mathematical results, and so on a variety of, not only with complex verification code difficult users, but also to add all kinds of interference characters, The beauty of its name to improve security, in fact, the development of engineers brain debris in a dead-head product, users are complaining. The purpose of the verification code is to distinguish between human and machine, to achieve the machine can not automatically operate, while making manual operation is convenient and elegant. In this case, we used a more interesting verification code, let people identify objects, in the verification code system pre-stored a large number of things, including animals, plants, furniture and so on everyday encounters, the process of verifying the user is the system from these things randomly selected a small number of graphics, and requires the user to select one of the preset answers to unlock.

Back to the steps to identify the crawler, we use a flowchart:


Achieve

We use Nodejs (Express) and Redis to implement anti-crawler systems, and Redis is used to store some counts.

1, to determine whether the browser access

When a page request is returned, the page access count for that IP in Redis is +1. In each page will introduce a JS, when the request of this JS file in Redis to the IP page access count-1, so that if not the browser request, Redis page count will continue to grow, if the browser request, download the page source code 1, then the browser load JS file minus 1, The page count in Redis is zeroed. We only need to determine whether the page count is zero for the browser access, we can also download the page but JS does not load this special case to leave a bit of room, set a threshold, such as: 5, the page count is greater than 5 to determine the IP within the crawler access.

2, Reptile white list recognition

If the previous step is identified as crawler access, then further detects the user HTTP header user-agent, IP, determine whether in the default whitelist. Block access to display verification codes if they are not present. This step is very simple, don't say much.

3. Concurrency limit under browser access

Also in Redis under each IP to do the count, and the above is the use of the Redis key expiration mechanism, each time the count accumulated key set in a certain amount of time expired, such as 5 seconds, this equivalent to a time slice, if there is another request in 5 seconds, the number of accounting increases 1, the expiration time will be extended 5 seconds, If there is no other request within 5 seconds, the key disappears. Thereafter a request comes in count starting from 1. We can set a threshold value, such as 20, any 5 seconds, there are 20 requests in the overrun, block access display verification code.

4. Elegant Verification Code

The system presets a lot of pictures, each picture is an animal, plant, furniture, etc., each picture has an ID value. Take any of these pictures out of 3, and select one of them as the standard answer, note that this process is done in the background of the program, the standard answer ID is placed in the session. The front page shows these 3 pictures, and according to the preset answer requires users to choose one, the user just select the corresponding picture, the form submitted to the background, the system will submit the ID and the session ID comparison is correct. Of course, each picture has a fixed ID value has been the vulnerability of the exhaustive, there is a lot of room for improvement, here only to discuss the prototype does not do too much discussion.

Effect


OK, next I will put out some implementation of the code, if you want to see the implementation of the effect, you can visit the meeting Bar (http://www.pengtouba.com/) test, home does not add anti-crawler strategy. Open the square http://www.pengtouba.com/weixin/cast-c1p1s1.html and then use F5 rape to refresh you will see the effect.


Code

Interception requests (similar to other languages such as Java can use interceptors)

App.get ('/weixin/* ', anticrawler.opendoor);//directory to be protected app.get ('/helper/close-door.js ', anticrawler.closedoor); Pseudo JS file


Anticrawler.js

/** * anti crawler * created by cherokee on 14-7-13. */var  settings = require (".. /settings.json "); Var redis = require (" Redis "); Var cache = require (".. /lib/cache.js "); Var vcode = require ('.. /lib/vcode.js '); Var ac_redis_cli = redis.createclient (settings[' anti_crawler_redis_db ' [' Port '] , settings[' anti_crawler_redis_db ' [' Host ']); var ip_record_expire = settings[' Anti_crawler_redis _db ' [' Ip_record_expire '];var ip_lock_expire = settings[' anti_crawler_redis_db '] [' Ip_lock_ Expire '];var ip_hair_expire = settings[' anti_crawler_redis_db ' [' Ip_hair_expire '];var DOOR _threshold = settings[' anti_crawler_redis_db ' [' Door_threshold '];var hair_threshold =  settings[' anti_crawler_redis_db ' [' Hair_threshold '];ac_redis_cli.on (' Ready ', function () {     console.log (' Redis for anti-crawleR is ready ');}); Ac_redis_cli.on (' Error ', function (err) {    console.error (' Redis for anti-crawler  error ' +err); Ac_redis_cli.on (' End ', function () {    console.error (' redis for anti-crawler  Closed ');}); Ac_redis_cli.select (settings[' anti_crawler_redis_db '] [' db '],function (err) {    if (err) throw  err;    else {        cache.set (' Ac_ Redis_cli ', ac_redis_cli,77760000);         console.log (' Redis for &NBSP;ANTI-CRAWLER&NBSP;SWITCH&NBSP;DB&NBSP: ' +settings[' anti_crawler_redis_db ' [' db ']);     }); Exports.opendoor = function (Req, res, next)  {    var  ua = req.get (' user-agent ');    var ip = req.ip;     var url = req.url;    if (/\/weixin\//.test (URL)) {        ac_redis_cli.exists (' Lock: ' +ip,function (Err,bol) {            if (BOL) {                 send421 (req,res);             }else{                 ac_redis_cli.get (' Door: ' +ip,function (err,d_num) {                     if (D_num>door_threshold) {//some one didn ' t use browser                         if (IsTrustSpider (UA,IP)) {/ It ' S&NBSP;TRUSTED&NBSP;SPIDER&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBsp;                kickdoor ( Ip,function (val) {                                  Leavehair (Ip,function (val) {                                      next ();                                  });                             });                         }else{                              blockit (req,res);                         }                     }else{// perhaps using simulated browser to crawl pages                          ac_redis_cli.get (' Hair: ' +ip,function (err,h_num) {                              if (h_num>hair_thresHold) {//suspicious                                 blockit ( Req,res);                             }else {                                  kickdoor (Ip,function (val) {                                      leavehair (Ip,function (Val) {                                          Next ();                                      });                                 });                              }                         });                     }                 });             }        });    }}; Exports.closedoor = function (req,res) {    ac_redis_cli.multi ()     &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;.DECR (' door: ' +req.ip)         .expire (' Door: ' +req.ip,ip_record_expire)         .exec (function (err,  Replies) {            if (Replies&&parseInt ( Replies[0] <0) {                 ac_redis_cli.set (' Door: ' +req.ip,0,function (ERR) {                     res.set (' Content-type ',  ' application/ X-javascript ');                      Res.send ($, ' {' Zeroize ': true} ');                 });             }else{                 res.set (' Content-type ',  ' application/x-javascript ');                 res.send ($, ' {' Zeroize ': false} ');             }        });} Exports.verify = function (req,res) {    var vcode = req.body.vcode ;     var origin_url = req.body.origin_url;    if ( Req.session.vcode&&vcode==req.session.vcode) {&NBSP;&Nbsp;      req.session.vcode = null;         ac_redis_cli.multi ()              . del (' Lock: ' +req.ip)             .del (' Door: ' + REQ.IP)             .del (' Hair: ' +req.ip)              .exec (function (err, replies) {                 res.redirect (Origin_ URL);             });     }else  send421 (Req,res,origin_url);} Var blockit = function (req,res) {    ac_redis_cli.multi ()          .set (' Lock: ' +req.ip,1)         .expirE (' Lock: ' +req.ip,ip_lock_expire)         .exec (function (err,  Replies) {            send421 (req,res);         });} Var send421 = function (req,res,origin_url) {    var code_map =  {};    var code_arr = [];    while (Code_ arr.length<3) {        var rindex = math.ceil ( Math.random ()  * vcode.list.length)  - 1;        if ( !code_map[rindex]) {            code_map[rindex]  = true;            code_arr.push (Rindex );         }    }    var  Answer = code_arr[math.ceil (Math.random ()  * 3)  - 1];     Req.session.vcode = answer;    res.status (421). Render (' weixin/421 ', {' code_list '): Code_arr, ' Code_label ': vcode.list[answer], ' origin_url ': origin_url| | Req.url});} Var istrustspider = function (UA,IP) {    var trustbots  =  [        /Baiduspider/ig,         /Googlebot/ig,        /Slurp/ig,         /Yahoo/ig,        /iaskspider/ig,         /sogou/ig,        /yodaobot/ig,         /msnbot/ig,        / 360spider/ig    ];     for (var i=0;i<trustbots.length;i++) {        if (trustBots[i]. Test (UA)) Return true;    }    return false;} Var kickdoor = function (ip,callback) {    ac_redis_cli.multi ()    &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;.INCR (' door: ' +ip)         .expire (' Door: ' +ip,ip_record_expire)         .exec (function (err, replies) {             if (callback) callback (replies?replies[0 ]:null);         }); Var leavehair = function (ip,callback) {    ac_redis_cli.multi ()    &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;.INCR (' Hair: ' +ip)         .expire (' Hair: ' +ip,ip_hair_expire)         .exec (FuNction (err, replies) {            if (callback) Callback (replies?replies[0]:null);         });

In practical applications, not only to detect user-agent, but also to have IP whitelist detection, the above code does not include IP whitelist.

The send421 function is the step to display the verification code, and the Verify function is the verification code that validates the user input.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.