How to write the syntax for robots.txt

Source: Internet
Author: User
ArticleDirectory
    • What do you want to do?
Use the robots.txt file to intercept or delete Web page Printing

The robots.txt file restricts the access to your website by the web-crawling search engine. These roaming bots are automatic. They will check whether they are blocked from accessing the robots.txt file of a specific webpage before accessing any website webpage. (Although some roaming bots may explain the commands in the robots.txt file in different ways, all the regular roaming bots will follow these commands. Naturally, robots.txt is not enforced, and some spam senders and other troublemakers may ignore it. Therefore, we recommend password protection for confidential information .)

You must use the robots.txt file only when your website contains content that you do not want your search engine to be indexed. If you want the search engine to index all the content on the website, you do not need the robots.txt file (or even the empty robots.txt file is not required ).

Although Google does not capture or index the content of a Web page intercepted by robots.txt, if we find this content on other web pages on the Internet, we will still index the web pages of these web pages. Therefore, the webpage may have other public information (such as the positioning text in the link to the website or the title of the Open Directory Project (www.dw..org ), may appear in Google search results.

To use the robots.txt file, you must have the permission to access the root directory of your domain (if you are not sure whether you have the permission, check with your network administrator ). If you do not have access to the root directory of the domain, you can use the robots meta tag to restrict access.

To completely disable the content of a webpage from being indexed on the Google network (even if other websites are linked to this webpage), use the noindex metadata. As long as googlebot captures the webpage content, it will see the noindex meta tag and disable the webpage from being displayed in the network index. What do you want to do?

Use the robots.txt generation tool to generate the robots.txt File

    1. On the website administrator tool homepage, click the desired website.
    2. InWebsite ConfigurationClickCapture Tool Access.
    3. ClickGenerate robots.txtLabel.
    4. Select the access permission of your default roaming bot. We recommend that you allow all roaming devices to run and use the following steps to exclude all specific roaming devices that you do not want to access your website. This helps prevent unexpected interception of important crawling tools on your website.
    5. Specify any additional rules. For example, to prevent googlebot from accessing all files and directories on your website, follow these steps:
      1. InOperationList, selectDisallow.
      2. InRoaming BotList, clickGooglebot.
      3. InFile or directoryBox, type/.
      4. ClickAdd. In the robots.txt fileCodeWill be generated automatically.
    6. Save the robots.txt file by downloading the file or copying the content to a text file and saving it as robots.txt. Save the file to the top-level directory of your website. The robots.txt file must be located in the root directory of the domain and named "robots.txt ". The robots.txt file in the subdirectory is invalid because the roaming bot only searches for the file in the root directory of the domain. For example, http://www.example.com/manyou .txt is a valid location, but http://www.example.com/my website /manyou .txt is an invalid location.

The rule specified in the robots.txt file is a request rather than a mandatory command. Googlebot and all well-known roaming devices will follow the instructions in the robots.txt file. However, some rogue roaming bots (such as spammers and roaming bots with illegal network content) may not comply with this file. Therefore, we recommend that you store confidential information in a password-protected directory on the server. In addition, different roaming bots may have different interpretations of the robots.txt file, and not all roaming bots support each instruction in the file. We will try our best to create the robots.txt file for all roaming machines, but the interpretation of these files cannot be guaranteed.

To check whether your robots.txt file is normal, use the robots.txt tool in the website administrator tool.

Manually create the robots.txt File

The simplest robots.txt file uses two rules:

    • User-Agent: A roaming bot that applies the following rules
    • Disallow: URL to intercept

These two lines are considered as an entry in the file. You can add any number of entries as needed. You can add multiple disallow rows and multiple user-agents to an entry.

Each part of the robots.txt file is independent, rather than built on the previous part. For example:

 
User-Agent: * disallow:/folder 1/User-Agent: googlebotdisallow:/folder 2/

In this example, only URLs that match/folder 2/are forbidden by googlebot.

User-Agent and roaming Bot

The User-Agent is a specific search engine. The network roaming Server database lists many common roaming servers. You can set an entry to apply to a specific browser (listed as a display name) or all browser (listed as an asterisk ). Entries for all roaming devices should be in the following format:

 
User-Agent :*

Google uses a variety of different roaming devices (User-Agent ). The roaming bot used for our webpage search isGooglebot. Other roaming devices such as googlebot-mobile and googlebot-image follow the rules you set for googlebot, but you can also set specific rules for these specific roaming devices.

Intercept User-Agent

The disallow row lists the webpages you want to intercept. You can list a specific URL or mode. The entry should start with a forward slash.

    • to intercept the entire website , use a forward slash.
       disallow:/
    • to intercept a directory and all its contents , add a forward slash after the directory name.
       disallow:/useless directory/
    • to intercept a webpage , list the webpage.
       disallow:/private file .html 
    • to delete a specific image from a Google image , add the following content:
       User-Agent: googlebot-imagedisallow: /images/dog .jpg 
    • to remove all images on your website from the goo le image , run the following command:
       User-Agent: googlebot-imagedisallow: /
    • to intercept a specific file type (for example. GIF) , use the following content:
       User-Agent: googlebotdisallow :/*. GIF $ 
    • to block crawling of webpages on your website, and display AdSense ads on these webpages , disable all roaming servers except mediapartners-Google. In this way, the webpage cannot appear in the search results, and the mediapartners-Google browser can analyze the webpage to determine the advertisement to be displayed. The mediapartners-Google roaming bot does not share webpages with other Google user-agents. Example:
       User-Agent: * disallow:/User-Agent: mediapartners-googleallow:/

Note that commands are case sensitive. For example,Disallow:/junk_file.aspHttp://www.example.com/junk_file.asp, and http://www.example.com/junk_file.asp. Googlebot ignores blank content (especially empty rows) and unknown commands in robots.txt.

Googlebot supports submitting site map files through the robots.txt file.

Pattern Matching

Googlebot (but not all search engines) follows certain pattern matching principles.

    • To match consecutive characters, use the asterisk (*). For example, to intercept access to all subdirectories starting with "private", use the following content:
      User-Agent: googlebotdisallow:/private */
    • To intercept all question marks (?) URL access(Specifically, this type of URL starts with your domain name, followed by any string, followed by a question mark, and followed by any string), please use the following content:
      User-Agent: googlebotdisallow :/*?
    • To match the end character of a URL, Use $. For example, to intercept all URLs ending with. xls, use the following content:
      User-Agent: googlebotdisallow:/*. xls $

      You can use this mode in combination with the allow command. For example, if? Represents a session ID, so you may want to exclude include? To ensure that googlebot does not capture duplicate web pages. However? The ending URL may be the version of the webpage you want to include. In this case, you can perform the following settings on your robots.txt file:

      User-Agent: * allow :/*? $ Disallow :/*?

      Disallow :/*?The command will block the inclusion? (Specifically, it intercepts all URLs starting with your domain name, followed by any string, followed by question marks, and followed by any string ).

      Allow :/*? $The command will allow? Any URL Ending with your domain name (specifically, it will allow all URLs starting with your domain name, followed by any string, followed ?,? URLs that are not followed by any characters ).

Test the robots.txt File

Test robots.txtThe tool specifies whether your robots.txt file will accidentally prevent googlebot from accessing a file or directory on your website, or whether it will allow googlebot to capture files that should not be displayed on the network. When you enter the recommended robots.txt file text, the tool reads the file text in the same way as googlebot, and lists the file's role and any problems found.

To test the robots.txt file of a website, follow these steps:

    1. On the website administrator tool homepage, click the desired website.
    2. InWebsite ConfigurationClickCapture Tool Access
    3. If not selected, clickTest robots.txtLabel.
    4. Copy the content of your robots.txt file and paste it into the first box.
    5. InURLBox to list the websites to be used for testing.
    6. InUser-AgentSelect the User-Agent from the list.

The system does not save any changes you have made to this tool. To save the changes, copy and paste the changes to your robots.txt file.

This tool only provides results for Google User-Agent (such as googlebot. Other roaming devices may not be able to interpret the robots.txt file in the same way. For example, googlebot supports the extended definition of the standard robots.txt protocol. It can parse allow: commands and some pattern matching. Therefore, although the tool will display the rows containing these extensions as parsed, please remember that this is only applicable to googlebot, not necessarily to other browsers that may crawl your website.

 

Http://www.google.com/support/webmasters/bin/answer.py? Hl = zh_cn & Answer = 156449

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.