Use. ashx files to avoid a variety of duplicate content as much as possible

Last Update:2015-04-17 Source: Internet

Author: User

Keywords If unlike having a large amount of

Tags .url asp aspx blog code content different directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Absrtact: Different links point to the page if there is a lot of the same content, this phenomenon will be called duplicate content, if a site repeated a lot of content, search engine will think that the value of this site is not high. So we should try to avoid all kinds of repetition

Different links point to the page if there are a lot of the same content, this phenomenon will be called "duplicate content", if a site repeated a lot of content, the search engine will think that the value of the site is not high. So we should try to avoid all kinds of repetitive content.

The repetitive content of dynamic Web sites is often caused by URL parameters, and URL rewriting will worsen this phenomenon (more intriguing yo, hehe). Because if the use of the original URL parameters, the search engine may make appropriate judgments, and learned that the duplicate content is caused by the URL parameters, automatic processing, and URL rewriting mask URL parameters, instead of the search engine can not recognize the URL parameters. Like what:

Original URL:
Http://www.freeflying.com/articles.aspx?id=231&catelog=blog
Http://www.freeflying.com/articles.aspx?id=231&catelog=news

URL rewritten after URL:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html

These URLs point to the page content is actually the same, are id=231 that article, but this article by the blog and news two columns, for various reasons of consideration, our final URL or as shown above.

There are two ways to deal with this, one is to "exclude" a Robot (robot) protocol, the other is to permanently redirect one URL to another by 301.

Today we'll talk about the robot agreement. Simply put, robot refers to the search engine, to Google, we call it "spider (spider)". Spiders are polite and will first ask for your advice before crawling the content of your Web page. And you communicated with robot based on the robot protocol. Specific to implementation, there are two ways:

1. Add a robots.txt text to the site root directory, such as:

#static content, forbid all the pages under the "Admin" folder
User: *
Disallow: admin

#行表示注释;

User refers to search engines, * for all search engines, can also specify specific search engines, such as User-agent:googlebot;

Disallow specifies directories or pages that are not allowed to be accessed, note: 1. This text is case-sensitive; 2. You must start with "\" to represent the site root directory;

Like the purpose of this series, we focus on asp.net technology. So for more robots.txt text notes, check out the http://www.googlechinawebmaster.com/2008/03/robotstxt.html

But how do we generate this file dynamically (a lot of demand in fact)? Perhaps we immediately think of is I/O operation, in the root directory to write a txt file ..., but there is also a way to: use a generic handler (. ashx file), the code is as follows:

<%@ webhandler language= "C #" class= "Handler" &NBSP;%>

Using system;
using system.web;

public class handler : ihttphandler {

   public void ProcessRequest (httpcontext context) {

       httpresponse response = context. Response;

         Response. Clear ();
&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP
       //response. contenttype = "filetype";   if you want to use IE6 to view the page, you cannot make this statement for reasons unknown
      &NBSP;&NBSP;&NBSP
    //The following two sentences should be used in the actual use of the database dynamically generated
         response. Write ("USER-AGENT:&NBsp;* \n ");
        response. Write ("disallow: /news/231.html \n"); The

    //references a static file content containing a mask that will not change
         response. WriteFile ("~/static-robots.txt");

        response. Flush ();
    }

    public bool isreusable {
         get {
             return false;
        }
    }

}

The general handler implements IHttpHandler, and in the previous Urlrewrite section we talked about HttpModule, in fact, in the ASP.net application lifecycle, there is a concept called "piping (Pipeline)": An HTTP request, After a httpmodule of a "filter/process", eventually reached a Httphandle "processor" part, HttpModule and Httphandle formed a "pipeline", very image yo, hehe. Put a picture on it:

If you are unfamiliar with it, look at page source code, you will find that page also implemented IHttpHandler, so *.aspx file is the most commonly used httphandle. But page is not just a httphandler, it also embeds complex page life cycle events, so from a resource-saving perspective, I can often use a custom, lighter-weight *.ashx file () to do some simple work. and into a TXT file similar, we can also generate verification code (JPG file), XML file, etc.

Then one more thing to do is to Urlrewrite:

void Application_BeginRequest (object sender, EventArgs e)
{
Code that SETUPCL on creator startup
HttpContext context = HttpContext.Current;
String currentlocation = context. Request.Url.LocalPath;

if (currentlocation.tolower () = = "/website1/robots.txt")
{
Context. RewritePath ("~/handler.ashx");
}

}

In this way, the spider will assume that there is a robots.txt file in the root directory of the site.

2. Add in Meta tags for pages that need to be screened

Noindex means that the page cannot be indexed

Nofollow means that the page cannot be "followed" (will be explained in detail in SEO hack)

This is the effect of static pages, and if you need to generate them dynamically, it's fairly straightforward:

protected void Page_Load (object sender, EventArgs e)
{
Htmlmeta meta = new Htmlmeta ();
Meta. Name = "robots";
Meta. Content = "Noindex,nofollow";
This. HEADER.CONTROLS.ADD (meta);
}

Meta can also specify description, keyword, etc., and its technical implementation is the same.

So, how do we choose between two ways? Some of my suggestions:

1. Use robots.txt as much as possible, which reduces the load on the site (albeit very small, since the spider looked at the robots.txt file, it will no longer request blocked related pages, and if the use of Meta method, the spider must first request the page, and then do not retrieve the judgment, then the HTTP request has been issued, the server-side resources have been waves , in addition, if too much meta shielding, will make spiders bad impression on the site, reduce or abandon the site of the search included;

2. robots.txt text matching from left to right, there is no regular match! So sometimes we have to use the Meta method. The URL that starts with our article:

Http://www.freeflying.com/blog/231.html

Http://www.freeflying.com/news/231.html

Finally, some caveats:

1. Do not use the same keyword and discription on all pages, this is a mistake we make easily, Although Articles.aspx is a page, but with URL parameters, it becomes thousands of pages, if you write dead on the page keyword and discription, that will make these thousands of pages are the same keyword and discription!

2. Avoid using URL based SessionID as much as possible. ASP. NET when a cookie is disabled on the client, you can set up a sessionid using a URL-like effect:

http://www.freeflying.com/(S (c3hvob55wirrndfd564))/articles.aspx

Free flying

Original link: http://www.cnblogs.com/freeflying/archive/2010/02/21/1670758.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More