How to reduce the crawling and indexing of invalid URLs no perfect solution

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Boy how

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

OLE Cloud Morning Watch classmate (digression: OLE is good ha. I was more than 20 years old when people said I like 40 years old, I was more than 40 years old when many said I like 30 more, men 20-50 years old can do basic one appearance posted a post: Through the Jingdong Mall to see the domestic first-line web SEO common problems. Suggested that the first look at the post and then come back to see this post, otherwise it is not easy to understand what is said below.

Simply put, post points out a serious and realistic SEO problem: Many sites, especially the company, product conditions filter system (such as the choice of product brand, price, size, performance, parameters, etc.) will produce a large number of invalid URLs, is called invalid only from the perspective of SEO, these URLs can not produce SEO effect , but have negative effects, so these URLs are not included as good, reasons include:

1. A large number of filter conditions page content duplication or very similar (mass copy content will make the overall quality of the site down)

2. A large number of filter conditions page does not correspond to the product, the page has no content (such as the choice of "100 yuan below 42-inch LED TV" and the like)

3. Most of the filter conditions page does not rank ability (ranking ability is much lower than the classification page) but waste a certain weight

4. These filter conditions page is not a product page included in the necessary channel (product page should have other internal chain to help crawl and included)

5. Crawling a large number of filter conditions page A great waste of spider crawling time, resulting in a useful page recorder will drop (filter Conditions page combination is huge)

So how do you try to make these URLs not crawling and indexing, include it? A few days ago, a post how to hide content may also become SEO issues discussed is similar to the problem, this filter page is to hide one of the types of content. But unfortunately, I can't think of a perfect solution at the moment. Cloud Morning Watch put forward two methods, I think can not perfect solution.

First, will not be included in the URL to keep the dynamic URL, and even deliberately more dynamic the better, to prevent being crawled and included. However, search engines can now crawl, include dynamic URLs, and technology is increasingly not a problem. Although the parameters to a certain extent is not conducive to inclusion, but 4, 5 parameters are usually included. We cannot confirm how many parameters are required to block the inclusion, so it is not a reliable method. And these URLs to receive the chain, there is no ranking ability, or will waste a certain weight.

The second method, a ban on the inclusion of robots. Similarly, the URL received within the chain also received the weight, robot files prohibit crawling these URLs, so the weight can not be passed out (search engines do not crawl do not know what export links), the page becomes the weight of only the black hole.

Links to these URLs with Nofollow is also not perfect, and the ban is similar, nofollow in Google's effect is that these URLs do not receive weights, weights are not assigned to other links, so the weight is also wasted. Baidu is said to support nofollow, but the weight of how to deal with the unknown.

These URL links in Flash, JS is also no use, search engines can crawl flash, JS in the link, and is expected to become more and more good at climbing. A lot of SEO overlooked point is that the link in JS can not only be crawled, but also to pass the weight, and the normal connection.

You can also make the filter conditional link into Ajax form, the user clicks will not access a new URL, or on the original URL, add a # After the URL, will not be treated as a different URL. Like the JS problem, search engines are actively trying to crawl, crawl content in Ajax, this method is not insurance.

Another way is to add a noindex+follow tag in the head section of the page, meaning that this page does not index, but tracks the links on the page. This can solve the problem of replication content, but also solve the problem of the weight of black holes (weights can be exported to other pages with the link), can not solve the problem is to waste the spider crawling time, these pages or spiders crawl to crawl (and then see the page in HTML noindex+follow tag), For some websites, the number of filter pages is huge, crawling these pages, spiders do not have enough time to crawl useful pages.

Another way to consider is to hide the page (cloaking), that is, using the program to detect the visitors, is the search engine spider words returned to the page to remove these filter conditions, is the user's words to return to the normal filter conditions of the page. This is an ideal solution, the only problem is that it may be treated as cheating. Search engines often talk to SEO to determine whether the highest principle of cheating is: if there is no search engine, you will not do it? Or is it just for search engines? Obviously, using cloaking to hide URLs that don't want to be crawled is done for search engines, not for users. Although the cloaking purpose of this case is good, no malice, but the risk is there, daring to try.

Another way is to use the canonical tag, the biggest question is whether Baidu support unknown, and canonical tag is the search engine recommendations, not instructions, that is, this tag search engine may not comply, is useless. In addition, the canonical label is intended to specify a standardized web site, filter the conditions of the page is not applicable to some doubt, after all, the content of these pages are often different.

One of the better methods at present is the iframe+robots ban. The filter part of the code into the iframe, equal to call other file content, for search engines, this part of the content does not belong to the current page, that is, hidden content. But does not belong to the current page does not mean that there is no, search engines can find the content and links in the IFRAME, or may crawl these URLs, so add robots prohibit crawling. IFrame in the content or there will be some weight loss, but because the link in the IFRAME is not from the current page to divert weight, but only from the call that file streaming, so the weight loss is relatively small. In addition to the headaches of typography and browser compatibility, a potential problem with the IFRAME approach is the risk of being considered cheating. Now search engines generally do not think that the IFRAME is cheating, many ads are placed in the IFRAME, but hide a bunch of links and hidden ads some subtle differences. Back to the search engine to judge the general principle of cheating, it is hard to say that this is not specifically for search engines do. Remember Matt Cutts said that Google might later change the way it handles the IFRAME, and they want to see everything that ordinary users can see on the same page.

In short, I don't have the perfect answer for this real, serious problem at the moment. Of course, can not be a perfect solution is not to live, different SEO focus on different sites, specific problems specific analysis, the use of one or more of the above methods should be able to solve the main problem.

And the biggest problem is not the above, but sometimes you want to let these filter pages are crawling and included, this is the beginning of the Cup. We'll discuss it later.

Author: Zac@seo Every day

Original: http://www.seozac.com/seo-tips/duplicate-urls-content/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More