Understand the usage of robots.txt Optimize search engine crawling and indexing

Source: Internet
Author: User
Keywords Website optimization
Tags .url access blogs directory document file forbidden google

The effect of optimizing robots.txt on Google and Baidu SEO is obvious. The same goes for WordPress blogs.

Let's take a look at what is robots.txt, what is the role?

What is robots.txt?

We all know txt suffix file is a plain text document, robots is the meaning of the robot, so the name suggests, robots.txt file is to search engine spider robot to see this plain text file. robots.txt is a well-documented specification that search engines follow generally. It tells Google, which search engines such as Baidu allow crawling, indexing, and search results to show which pages are banned. Search engine spider spider (Googlebot / Baiduspider) to visit your website page, first check whether your site has a robots.txt file in the root directory, if there are rules in accordance with the rules set up to crawl your website page And index. Such as Taobao set by blocking robots.txt Baidu search engine:

User-agent: Baiduspider
Disallow: /
User-agent: baiduspider
Disallow: /

The role of robots.txt

We know what is robots.txt, then what does it do, overall, the robots.txt file has at least two things:

1, by setting the screen search engine to access unnecessary inclusion of the site pages, you can greatly reduce the spider crawling pages occupied by the site bandwidth, small sites is not obvious, large sites are obvious.

2, set robots.txt can specify google or Baidu do not index which URLs, such as URL rewriting we will static dynamic URL into a permanent fixed link, you can set permissions through robots.txt to stop Google or Baidu and other search engines Index those dynamic web site, thus greatly reducing the site duplicate pages, SEO optimization has played a significant role.

The wording of robots.txt

How to write robots.txt file, in the following we will WordPress blog to make more specific examples. Here to remind a few robots.txt wording should pay attention to the place. Write the following code in your robots.txt file:

User-agent: *
Disallow:
Allow: /

robots.txt must be uploaded to your site's root directory, invalid in subdirectories;

robots.txt, Disallow, etc. must pay attention to the case, can not change;

User-agent, Disallow behind the colon must be in English state, the colon can be a blank space, you can not spaces. Some people say that there must be a space behind the colon on the Internet, in fact, it is not possible, see Google Chinese webmaster blog settings are like this: http://www.googlechinawebmaster.com/robots.txt;

User-agent said the search engine spider: the asterisk "*" on behalf of all spider, Google's spider is "Googlebot", Baidu is "Baiduspider";

Disallow: that does not allow search engine access and index directory;

Allow: specify the directory allows the spider to access and index, Allow: / Allow all, and Disallow: equivalent.

Examples of robots.txt file writing

All search engines such as Google / Baidu are forbidden to access the entire website

User-agent: *
Disallow: /

Allow all search engine spider to access the entire site (Disallow: Allow: / replace)

User-agent: *
Disallow:

Baiduspider is forbidden to visit your website, and other search engines such as Google are not blocked

User-agent: Baiduspider
Disallow: /

Only allow Google spider: Googlebot visit your site, prohibiting other search engines such as Baidu

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

Prohibit the search engine spider spider visit designated directory (spider do not visit these directories. Each directory should be separately stated, can not be together)

User-agent: *
Disallow: / cgi-bin /
Disallow: / admin /
Disallow: / ~ jjjj /

Prohibit search engine spider access to the specified directory, but allows access to a specified subdirectory of the directory

User-agent: *
Allow: / admin / far
Disallow: / admin /

Use the wildcard asterisk "*" to set the forbidden URL
(All search engines are forbidden to crawl all webpages in the ".html" format (including subdirectories) in the / cgi-bin / directory)

User-agent: *
Disallow: /cgi-bin/*.html

Use the dollar sign "$" to prohibit access to a suffix file (only allow access to ".html" format web page file.)

User-agent: *
Allow: .html $
Disallow: /

Stop all Google, Baidu and other search engines to visit the website with? Dynamic URL page

User-agent: *
Disallow: / *? *

Prevent Google spider: Googlebot visit some format images (banned access to .jpg format images)

User-agent: Googlebot
Disallow: .jpg $

Google spider only: Googlebot crawls webpages and .gif images (Googlebot can only crawl images and webpages in gif format, other formats are forbidden;
Other search engines not set)

User-agent: Googlebot
Allow: .gif $
Disallow: .jpg $
.......

Only Google spider is blocked: Googlebot fetches .jpg format images
(Other search engines and other format images are not prohibited)

User-agent: Googlebot
Disallow: .jpg $

Google and Baidu robots.txt file description: Google robotstxt, Baidu robots.txt.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.