First look at the spider list
Search engine |
User-agent (included) |
Whether PTR |
Note |
Google |
Googlebot |
√ |
Host IP Get domain name: googlebot.com primary Domain name |
Baidu |
Baiduspider |
√ |
Host IP Get domain name: *.baidu.com or *.baidu.jp |
Yahoo |
Yahoo! |
√ |
Host IP Get domain name: inktomisearch.com primary Domain name |
Sogou |
Sogou |
X |
*sogou Web spider/3.0 (+http://www.sogou.com/docs/help/webmasters.htm#07″)
*sogou Push spider/3.0 (+http://www.sogou.com/docs/help/webmasters.htm#07″)
|
Netease |
Yodaobot |
X |
*mozilla/5.0 (compatible; Yodaobot/1.0;http://www.yodao.com/help/webmaster/spider/"; ) |
Msn |
MSNBot |
√ |
Host IP Get domain name: live.com primary Domain name |
360 |
360Spider |
X |
mozilla/5.0 (Windows; U Windows NT 5.1; ZH-CN; rv:1.8.0.11) firefox/1.5.0.11; 360Spider |
Soso |
Sosospider |
X |
sosospider+ (+http://help.soso.com/webspider.htm) |
Bing |
Bingbot |
√ |
Host IP Get domain name: MSN.com primary Domain name |
Take another look at the example
<?php
PHP to determine the search engine spider crawler method
function Checkrobot ($useragent = ' ") {
static $kw _spiders = Array (' bot ', ' crawl ', ' spider ', ' slurp ', ' sohu-search ', ' Lycos ', ' Robozilla ');
static $kw _browsers = Array (' MSIE ', ' Netscape ', ' opera ', ' Konqueror ', ' Mozilla ');
$useragent = Strtolower (Empty ($useragent) $_server[' http_user_agent ': $useragent);
if (Strpos ($useragent, ' http://') = = False && Dstrpos ($useragent, $kw _browsers))
& nbsp; return false;
if (Dstrpos ($useragent, $kw _spiders))
return True
return false;
}
function Dstrpos ($string, $arr, $returnvalue = False) {
if (empty ($string))
return false;
foreach ((array) $arr as $v) {
if (Strpos ($string, $v)!== false) {
$return = $returnvalue? $v: true;
return $return;
}
}
return false;
}
if (Checkrobot ()) {
Echo ' Spider ';
}else{
echo ' Human ';
}
?>
Example
PHP Anti-resolution IP method
<?php
/**
* Check IP and spider authenticity
* (Check_spider (' 66.249.74.44 ', $_server[' http_user_agent '));
* @copyright http://blog.chacuo.net
* @author 8292669
* @param string $IP IP address
* @param string $ua UA Address
* @return False|spidername False detection failure is not in the specified list
*/
function Check_spider ($IP, $ua)
{
Static $spider _list=array (
' Google ' =>array (' Googlebot ', ' googlebot.com '),
' Baidu ' =>array (' Baiduspider ', '. Baidu. '),
' Yahoo ' =>array (' Yahoo! ', ' inktomisearch.com '),
' MSN ' =>array (' msnbot ', ' live.com '),
' Bing ' =>array (' Bingbot ', ' msn.com ')
);
if (!preg_match ('/^ \d{1,3}\.) {3}\d{1,3}$/', $ip)) return false;
if (empty ($ua)) return false;
foreach ($spider _list as $k => $v)
{
If you find it,
if (Stripos ($ua, $v [0])!==false)
{
$domain = gethostbyaddr ($IP);
if ($domain && stripos ($domain, $v [1])!==false)
{
return $k;
}
}
}
return false;
}
Currently only a few search engine detection, these are available to do reverse parsing query. Do not do the reverse parsing query, it is best to do speed limits, users will use them to forge a search engine to crawl your resources