I. About robots documents
1. When the search engine spider visits the site, it will first see if there is a robots.txt file in the root directory of the website, which is used to instruct the search engine not to crawl certain content or allow crawling of certain content. Note: Even if all content is allowed to be crawled, an empty robots.txt file is placed in the root directory.
2. robots.txt only makes sense if it is necessary to prohibit crawling of certain content, which means allowing search engines to crawl all content if the file is empty.
3. Grammar Explanation:
The simplest robots file: Do not search all search engines crawl any content, the wording is:
user-agent:*
disallow:/
Where user-agent is used to specify which spider the rule applies to. A wildcard * represents all search engines. If only apply to Baidu Spider, then write as: User-agent:baiduspider. Google Spider: Googlebot.
Disallow tells spiders not to crawl certain files. Like disallow:/post/index.html, tell the spider to not crawl the index.html file under the Post folder. After disallow, nothing is written, which means that all pages are allowed to crawl.
Ii. about META Robots tags
1. Used to instruct the search engine to prohibit indexing the contents of this page.
2. Grammar explanation: <meta name= "Robots" content= "Noindex,nofollow" > means to prohibit all search engines from indexing this page, and to prevent the tracking of links on this page.
Noindex: Tell the spider not to index this page.
Nofollow: Tell the spider not to follow the links on this page.
Nosnippet: Tell Spider Fury to display the explanatory text in the search results.
Noarchive: Tell the search engine not to show the snapshot.
NOODP: Tell the search engine not to use the title and description in the Open directory.
How to set up not to let search engines ingest certain pages