Summary of all aspects of the actual combat documents need attention

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Usually do the station process, in order to concentrate weight or balance allocation weights, will use the robots file. Although this file is just a simple notepad, but the contents of the site can affect the overall collection of information. This file looks very simple, but in the actual process there are a lot of stationmaster do not know how to write their own website of the document, but also some afraid of mistakes, simply did not write. So combined with these circumstances, I and you have summed up in the actual document should be how to correctly write.

Want to write good this file, must pay attention to several aspects, common is its format, some wildcard use, search engine spider classification and some other common errors. The only way to get a clear picture of these problems is to write the correct files that match your website. Well, start today's content, welcome to correct me.

Robot function: With a simple words, it is a protocol, tell the search engine which content can crawl included, which can not crawl included, so as to achieve a simple control of the purpose of the site weight. When the search engine visit a website, first look at the root directory there is no robots.txt plain text files, if any, will follow the above agreement, will not crawl which banned pages, and not be banned or the robot file is empty, the search engine will default access to all the files. Incidentally, if the site does not have a file, then it is best to do a place in the root directory, even if the inside is empty, also help search engines.

Format to be careful: I have encountered because of the format of the problem caused by the site is not included in the phenomenon, especially the use of the ban included. In the robots file, the most common is/use, this/represents the root directory of the site, if the disallow after the/, then it means that the search engine to crawl any content. Common formats are as follows:

user-agent:*

disallow:/

This content means to prohibit the search engine to crawl all content, if want all all collect, then just need to turn disallow into allow on the line.

Wildcard characters: Sometimes the site has a lot of duplicate content, such as some of the consumer Web site sorting function, printing function, paging function, etc., these do not need to be crawled, so in the robots need to use wildcard characters. The common wildcard character is *, which represents all search engines. $ to match the characters at the end of the URL, such as all files that want the search engine to crawl HTML as a suffix, you can write this:

user-agent:*

allow:.html$

If you are forbidding search engines to crawl all HTML pages, you can write this:

user-agent:*

Disallow:/*.html

Do not use spiders classification: Different search engines generally have different spiders, in the robots also need to search engine spiders to define, if the need for all search engines to extract or not squeeze, then use the above mentioned wildcard * can be. Here to share with you the different spiders of different search engines. Baidu Spider is the Baiduspider,google spider is Googlebot, now generally use these two more, in addition, search and Sogou spider is similar to Baidu Spider, use the place is not too much. Generally, the mainstream search engine supports the robots file.

Application examples and precautions: Each line in the robots must correspond to a project, if there are two or more prohibited, then must be written separately, one line, can not be placed in a row, otherwise it will not be recognized. If you want to let a certain search engine does not crawl, and all other search engines crawl words, to write two separate user and disallow. In addition, if you allow the part of a folder to be crawled, the part is forbidden to crawl, then to disallow and allow mixed use, such as SEO folder in the AA folder is not crawled, then you can write:

user-agent:*

disallow:/seo/

allow:/seo/aa/

In addition, can also be in the robot file to write the site map location, so more conducive to search engine crawling and crawling. such as Sitemap:xml map location. This will need to be included in the file through XML to the search engine, can speed up the collection. However, it should be pointed out that not all the files after the screen is not out of the search results, if the page has an import link, then the user can still query from the search results to this page, so want to not appear at all, to match the meta-robots tag to achieve. I'll share it with you later.

Well, this article here, if there are other places do not understand, welcome to communicate with me. This article from: Fun broadcast network, URL: http://www.7v7.cc/, reproduced please retain the copyright, thank you!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.