Tips for avoiding spiders crawling and indexing errors: bypassing conflicts

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Avoid what you

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

As you know, you can't always rely on spider engines to work very effectively when accessing or indexing your site. Relying entirely on their own ports, spiders produce a lot of duplicate content, some important pages as garbage, the index should not be shown to the user's chain access port, there will be other problems. There are tools that allow us to fully control the spiders ' activities within the site, such as meta-robots tags, robots.txt, canonical tags, etc.

Today, I talk about the limitations of the use of robotic control technology. In order for spiders not to crawl a page, webmasters sometimes use multiple robotic control techniques to prevent search engines from accessing a Web page. Unfortunately, these technologies can sometimes contradict each other: on the other hand, such restrictions can hide certain dead chains.

So what happens when a page is blocked from the robots file, or is used noindex tag and canonical tag?

Quick Review

Before we go into the subject, let's take a look at some of the limiting techniques of the mainstream robots:

Meta Robot label

Meta robot tag (Meta robots tag) establishes page level descriptions for search engine robots. The meta robot tag should be placed on the head of the HTML file.

Spec tag (canonical tag)

The spec tag (canonical tag) is a meta label for the page level in the HTML header of the page. It tells the search engine which URL to display is canonical. The goal is to keep the search engine from grabbing duplicate content while concentrating the weight of the duplicated pages on the page of the specification.

The code is like this:

X Robot Label

Since 2007, Google and other search engines have supported the use of X-robots-tag as a way to tell spiders to crawl and index the order of precedence, X-robots-tag is located in the HTTP header, used to notify spiders to crawl and index files. This label is useful for controlling indexes on those non-HTML files, such as PDF files.

Robot label

Robots.txt allows some search engines to go inside the site, but it doesn't guarantee that a particular page will be crawled or indexed. Unless it is for SEO reasons, robots.txt is worth using only if it is really necessary or if there is a need to shield the robots on the site. I always recommend using the Meta data label "NOINDEX" to replace it.

Avoid conflicts

It is unwise to use both methods to limit the robot entrance:

· Meta robots ' noindex ' (Meta robot Tag "NOINDEX")

· Canonical tag (when pointing to a different URL) (Standard label)

· Robots.txt Disallow

· X-robots-tag (x Robot label)

Although you'd like to keep the search results on the page, one way is always better than two. Let's take a look at what happens when there are a lot of path control techniques in a single URL.

Meta-Robots ' noindex ' and canonical tags

If your goal is to pass the weight of a URL to another URL and you don't have a better way to do it, you can only use the canonical tag. Do not use the Meta robot label "NOINDEX" to trouble themselves. If you use two robot methods, search engines may not see your canonical tags at all. The utility of the weight transfer will be ignored, because the robot's noindex tag will make it invisible to the canonical tag!

Meta-Robots ' noindex ' & X-robots-tag ' Noindex '

These labels are superfluous. These two tags are placed on the same page I can only see the bad effects of SEO. If you can change the header file in the Meta robot ' noindex ', you shouldn't use the X robot tag.

Robots.txt Disallow &meta ' noindex '

This is the most common conflict I've ever seen:

The reason I favor meta-noindex is because it effectively blocks a page from being indexed, and it can also pass weights to a deeper page that connects to the page. 　　This is a win-win approach. The robots.txt file does not allow a full restriction on search engines to view information on the page (and its valuable internal links), especially when URLs are indexed. What's the benefit? I have written an article on this subject alone.

If two tags are used, the robots.txt guarantee will make meta robots ' noindex ' not be seen by spiders. You will be affected by the disallow in robots.txt and miss all the benefits of all the meta-Robots ' noindex '.

The source of the article for www.leadseo.cn Shanghai, the website optimization experts, reprint please keep the source! Thank you very much!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More