Bamboo Shadow Breeze: Neglected seo sharp weapon robots.txt

Source: Internet
Author: User
Keywords Home of the shopkeeper Robots.txt

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Bamboo Shadow Wind To do the site also some years, should be webmaster Friend's request today to share a little bit of my experience with you. Today's topic focuses on robots.txt. Webmaster Friends may be less concerned about robots.txt, but use robots.txt absolutely to your site is a benefit without a harm.

Statement: This article is only suitable for beginners, veteran please float gracefully.

Topic One: What is robots.txt?

Here, citing Baidu's answer, Robots.txt is a plain text file that must be placed underneath the root directory, and the file name must be all lowercase letters, "robots.txt," in which you declare the part of the site that you do not want to be robot access, so Some or all of the content of the site can not be included in the search engine, or the designated search engine only contains the specified content.

Topic Two: robots.txt specific how to use?

function 1:seo Search Spiders crawl site map, better included Site page.

Now google\ Yahoo and other foreign search engines have been supported in the robots.txt file to indicate the sitemap file links in the spider visit robots.txt when you tell the location of the site map, in order to facilitate spiders to better include your site's page. The use of syntax is Sitemap:http://www.##.com/sitemap.xml (Google) or Sitemap:http://www.##.com/sitemap.txt (Yahoo). In which map files you can use the site map production software generation, or you write your own program generation.

Function 2: All search spiders are forbidden to crawl all the content of your site or the specified directory. There are several common situations in the actual practice of the site:

The first is to prohibit all search spiders from crawling any content on your site.

If my site has just uploaded to the server or virtual host debugging, but because the site page title or keyword, etc. has not been optimized, and outside the site of the chain, but do not want to let search engines included, you can prohibit all search engines to include any of your pages.

Here I give a negative example, 06 I built a website, using the Content management program of the Dream, the first application of a template, added some content on the excitement to the search engine submitted, the next day was included in the search engine, and after a few days also released hundreds of content, But then I found a more beautiful and refreshing template, changed and rebuilt all the pages, so changed several times. Because each search spider is the mother, the site page often changes, especially the title and other important attributes of the changes so that she is very insecure, the site generated a serious distrust, the results of my Site page after a month or two to recover. So you webmaster in the Web site on the search before the open must find the site positioning, and in the optimization after the search engine is not too late to open.

For example, your site is only your love home with your lover, only you entertain, and do not want to be crawled, and for example, your site is the company's internal use of the site, is the entire hidden content, do not need to crawl any spiders, or any other specific circumstances to prohibit any search engine crawl.

Prohibit all search engine included in the Web site any page syntax:

User: *
Disallow:/

The second scenario is the need to ban all search engines from crawling certain directory sites.
(1) Some of the Web site directory is the program directory, there is no need to crawl, in order to improve server performance, to avoid the search crawl when consuming server resources, you can prohibit all search engines crawl these directories. (2) Part of the site directory is some member information or is actually sensitive, private content, prohibit search engine crawl. (3) Some of the contents of the content is collected without making any changes to the content, this part of the content is only to enrich the content, but do not want to be indexed by search engines, then need to prohibit search engine crawling. (for example, I used to do a website, part of it is completely original content, used to be searched crawl.) Part of the content is the full collection and only to enrich the content of the site, improve the user experience, but do not want to let the search engine included as spam information and to the site down right, then this part of the directory I will screen search spiders! Et cetera!

A syntax example that prohibits all search engines from crawling specific directories or specific pages is:

User: *
Disallow:/plus/count.php
Disallow:/include
Disallow:/news/old

If you are interested, you can go to Bamboo shadow breeze New Line of dianzhu2.com to see my robots.txt, there are some specific examples.

Function 3: Forbid a spider to crawl all the content of your site.

Here are a few cases, (1) You have been seriously Baidu down the right, despised, humiliated, or you are members of the anti-Baidu Alliance, so you want to break with it, to prohibit it to crawl your site any content. (2) Your site has been NB-like Taobao, to a comprehensive ban on Baidu included in your page. We can look at the robots.txt Taobao, Taobao due to commercial interests and other factors have been shielded from Baidu, but because Baiduspider is the mother, see Ma Yun handsome with an et-like, or a thick skin included Taobao 1060 content. We can in Baidu Search bar input site: (taobao.com) verification. (3) Any other search engine to prohibit the inclusion of all the content of your site.

The syntax for prohibiting a specified search engine from crawling any content on your site is:

User-agent:baiduspider
Disallow:/

Function 4: Only allow the specified search spider to crawl the content of your site.

Because our site traffic mainly from several major search engines, you do not want to foreign or domestic search spiders, rogue spiders to the server crawl your site content, thereby consuming server resources, then this time, this syntax will work.

Only allow the specified search spider to crawl the contents of your site syntax for:

User-agent:baiduspider
Disallow:

User: *
Disallow:/

One of the User-agent:baiduspider Disallow: You can list some of the most searched spiders you allow. In this particular need to remind, must correctly write robots.txt, so as not to bring unnecessary harm to the site. Baidu Spider: Baiduspidergoogle spider: Googlebot Tencent Soso:sosospideryahoo spider: Yahoo slurpmsn Spider: Msnbot

Role 5: Prohibit all search engines from crawling all or specific types of files in your site.

All search engines are forbidden to crawl only web pages, and no pictures are allowed. The syntax is:

User: *
Disallow: jpg$
Disallow: jpeg$
Disallow: gif$
Disallow: png$
Disallow: bmp$

If you are only banning a particular search engine, follow the method described above to change the wildcard character to a specific spider name.

Role 6: Prevent search engines from displaying snapshots of web pages in search results and indexing pages only.

It uses the following methods:

Baidu supports the use of Web pages to prevent search engines from displaying snapshots of websites. The method is as follows:

To prevent all search engines from displaying snapshots of your site, place this meta tag in the <HEAD> section of the Web page: <meta name= "Robots" content= "noarchive" > to allow other search engines to display snapshots, but only to prevent Baidu display , use the following tags: <meta name= "Baiduspider" content= "noarchive" > Note: This tag only prohibits Baidu from displaying a snapshot of the webpage, Baidu will continue to index the Web page and display a summary of the pages in the search results. If it is Google, it is

Last note: Some friends may have enabled webmaster log function to analyze spider crawling and user access, Spiders to find robots.txt files, if not found, the server will also record a 404 error in the log, in order to reduce the log file, remove unwanted information, so it is recommended that you add robots.txt in the root directory of the site, even if it is the empty robots file.

Other more usage, you need to combine the actual combat slowly summed up. The site of the new online today, the content will be all original, welcomed the same kind of webmaster exchanges and put forward suggestions. qq:1030036466 Shopkeeper Home: http://dianzhu2.com

This article starts A5, welcome reprint, but please keep the link.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.