There is a power in fact has been infiltration of a large number of sites and pages, we usually do not see them, and they are often very outrageous, most people do not even know it exists, we do not misunderstand, in fact, I say is search engine crawler and robot. Every day, hundreds of such reptiles come out to search the site quickly. Whether or not Google intends to index the entire network, or if the spam robot is going to collect a large number of email addresses, they are often looking for aimless purposes. As a website owner, we can control which actions robots can do through a file called robots.txt.
Creating robots.txt Files
Okay, now let's get to the action. To create a text file called robots.txt, make sure that it has the correct file name. The file must be uploaded to the root of your site, not the level two directory (for example, it should be http://www.mysite.com, not http://www.mysite.com/stuff), only meet the above two points, that is, the filename is correct and the path is correct, The search engine will work according to the rules of the file, otherwise robots.txt is just a regular file, it has no effect.
Now that you know how to name the file and where it should be uploaded, then you'll learn to type the command in this file, and the search engine will follow a protocol called Robot exclusion Protocol. In fact, its format is very simple, and can meet most of the control needs. The first is a line of useragent used to identify the reptile type, followed by one or more rows of disallow that restrict the crawler's access to some parts of the site.
1) robots.txt Basic settings
Disallow:/According to the above statement, all reptiles (shown here with *) are not allowed to index any part of your site, here/represent all pages. Normally this is not what we need, but here is just a concept for everyone.
2 Now let's make some minor changes. Although every webmaster likes Google, you probably don't want Google's Mirror robot to dig up your site, and you don't want it to put a mirror of your site on the Web, to achieve online search, and if it's just to save the bandwidth of the server on which your site is located, the following statement can do it
Disallow:/3 The following code does not allow any search engine and robot to mine directory and page information
Disallow:/TUTORIALS/BLANK.HTM4) You can also set different targets for multiple robots, and look at the following code
Disallow:/privatedir/This setting is very interesting, here we prohibit all search engines for our site mining operations, in addition to Google, where Google is allowed to access in addition to/cgi-bin/and/privatedir/all sites. This shows that rules can be customized, but not inherited.
3 There is another way to use disallow: that is, to allow access to all content of the site, in fact, as long as the colon does not enter anything on the
Disallow: Here, all reptiles except Alex are not allowed to search our site.
4 Finally, some reptiles now support the Allow rule, the most famous is Google. As the name of this rule says, "Allow:" allows you to accurately control which files or folders are accessible. However, this document is not currently part of the robots.txt protocol, so I recommend that it be used only when it must be used, because some of the less intelligent reptiles may think it is wrong.
The following is from Google's FAQs for webmasters, if you want to not dig your site except Google, the following code is a good choice