How can I accurately determine whether a request is a request sent by a search engine crawler (SPIDER ?, Search engine Crawler
Websites are often visited by various crawlers. Some are search engine crawlers, and some are not. Generally, these crawlers have UserAgent, and we know that UserAgent can be disguised, userAgent is essentially an option setting in the Http request header. You can set any UserAgent for the request by programming.
Therefore, using UserAgent to determine whether the request initiator is a search engine crawler (SPIDER) is unreliable, the more reliable method is to determine whether the host name corresponding to the requester's ip address is the host name of the search engine's own house.
To obtain the ip host, run the nslookup command in windows and run the host command in linux. For example:
Here I run the nslookup ip command in windows, and the host name of this ip address is a crawl-66-249-64-119.googlebot.com. This indicates that this ip address is a google crawler, and the domain names of google crawlers are xxx.googlebot.com.
We can also obtain the ip host information through the python program. The Code is as follows:
import socketdef getHost(ip): try: result=socket.gethostbyaddr(ip) if result: return result[0], None except socket.herror,e: return None, e.message
The above Code uses the gethostbyaddr method of the socket module to obtain the Host Name of the IP address.
The domain names of common spider are related to the domain names of the search engine official website, for example:
- Baidu's spider is generally a sub-domain name of baidu.com or baidu.jp
- Google crawlers are generally subdomains of googlebot.com.
- The Microsoft bing search engine crawler is a subdomain name of search.msn.com.
- Sogou spider is a subdomain name of crawl.sogou.com
Based on the above principles, I wrote a tool page that provides a tool page to determine whether the ip address is a real search engine, this page provides webpage judgment tools and IP addresses of common google and bing search engine crawlers.
Page address: http://outofmemory.cn/tools/is-search-engine-spider-ip/
The Code provided in this article is python code, which can be implemented through c # code. The principle is the same.
The IP segment of the common search engine spider is included:
Spider name |
IP address |
Baidusp |
202.108.11.*220.181.32.*58.51.95.*60.28.22.*61.135.162.*61.135.163.*61.135.168 .* |
YodaoBot |
202.108.7.215 202.108.7.220 202.108.7.221 |
Sogou web spider |
219.234.81.*220.181.61 .* |
Googlebot |
203.208.60 .* |
Yahoo! Slurp |
2018.0.181.*72.30.215.*74.6.17.*74.6.22 .* |
Yahoo ContentMatch Crawler |
119.42.226.*119.42.230 .* |
Sogou-Test-Spider |
220.181.19.103 220.181.26.122 |
Twiceler |
38.99.44.104 64.34.251.9 |
Yahoo! Slurp China |
2018.0.178 .* |
Sosospider |
124.115.0 .* |
CollapsarWEB qihoobot |
221.194.136.18 |
NaverBot |
202.179.180.45 |
Sogou Orion spider |
220.181.19.106 220.181.19.74 |
Sogou head spider |
220.181.19.107 |
SurveyBot |
216.145.5.42 64.246.165.160 |
Yanga WorldSearch Bot v |
77.91.224.19 91.205.124.19 |
Baidusp- mobile-gate |
220.181.5.34 61.135.166.31 |
Discobot |
208.96.54.70 |
Ia_archiver |
209.234.171.42 |
Msnbot |
65.55.104.209 65.55.209.86 65.55.209.96 |
Sogou in spider |
220.181.19.216 |