Simulation Crawl:
Curl-i- A ' Baiduspider ' hello.net
The resulting effect:
http/1.1 OK
Server:nginx
date:wed, 07:26:48 GM
The above instructions allow crawlers
If it's http/1.1 403 Forbidden
----------------------------------------------------------------------------------------
Method 1,
Writing in the server segment
multiple HTTP User Agent Pipelines |
server {
if ($http _user_agent ~* "qihoobot| baiduspider| Googlebot ")
{
return 403;
}
}
Refuse to httpuseragent in wget way, add the following content
# # Block HTTP User Agent-wget # #
if ($http _user_agent ~* (Wget)) {
return 403;
}
# # Block software Download user Agents # #
if ($http _user_agent ~* lwp::simple| Bbbike|wget) {
return 403;
}
Method 2
Use the robots.txt file: for example, to prevent crawling of all crawlers, but this effect is not very obvious
User-agent: *
Disallow:/
Method 3. Separate separation
Enter the Conf directory under the Nginx installation directory and save the following code as agent_deny.conf
Cd/usr/local/nginx/conf
Vim agent_deny.conf
#禁止Scrapy等工具的抓取
if ($http _user_agent ~* (scrapy| curl| HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http _user_agent ~ "feeddemon| jikespider|^$ ") {
return 403;
}
#禁止非GET | head| Post-mode fetching
if ($request _method!~ ^ (get| head| POST) ($) {
return 403;
}
Then, insert the following code after location/{in the site-related configuration:
Include agent_deny.conf;
The last recommendation is to use method one
No crawler in Nginx