Catalog of this series
Different links to the page if there are a lot of the same content, this phenomenon will be called "duplicate content", if a site is a lot of duplicate content, the search engine will think that the value of the site is not high. So we should try to avoid all kinds of repetitive content.
The repetition of dynamic Web sites is often caused by URL parameters, and URL rewriting can worsen this phenomenon (more intriguing yo, hehe). Because if the original URL parameter is used, the search engine may make the appropriate judgment, and know that the duplicate content is caused by the URL parameter, the corresponding processing automatically, and URL rewrite will obscure the URL parameters, but the search engine does not recognize the URL parameters. Like what:
the original URL:
http://www.freeflying.com/articles.aspx?id=231&catelog=blog
http://www.freeflying.com/articles.aspx?id=231&catelog=news
URL after URL rewrite:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html
These URLs point to the page content is actually the same, is the id=231 article, but this article by the blog and news two columns, for a variety of reasons, our final URL is as shown above.
There are two ways to deal with it: one is to "exclude" using the Robot (robot) protocol, and the other is to permanently redirect one URL to another by 301.
Today we'll talk about the robot agreement. Simply put, robot refers to the search engine, for Google, we call it "spider". Spiders are very polite and will first ask for your opinion before crawling your Web content. And you and robot are communicating based on the robot protocol. Specific to the implementation, there are two ways:
1. Add a robots.txt text to the Web site root directory, such as:
#static content, forbid all the pages under the "Admin" folder
User-agent: *
Disallow:/admin#行表示注释;
User-agent refers to the search engine, * for all search engines, you can also specify specific search engines, such as User-agent:googlebot;
Disallow specifies directories or pages that are not allowed to be accessed, note: 1. This text is case sensitive; 2. Must start with "\" to indicate the root directory of the Web site;
As with the purpose of this series, we focus on ASP. So for more robots.txt text notes, check out the http://www.googlechinawebmaster.com/2008/03/robotstxt.html
But how do we dynamically generate this file (which is actually quite a lot of demand)? Perhaps we immediately think of is I/O operation, in the root directory to write a txt file ..., but in fact, there can be a way: using a generic handler (. ashx file), the code is as follows:
Code <%@ WebHandler Language="C #"Class="Handler" %>
usingSystem;
usingsystem.web;
Public classHandler:ihttphandler {
Public voidProcessRequest (HttpContext context) {
HttpResponse Response=context. Response;
Response. Clear ();
//Response. ContentType = "Text/plain"; If you want to use IE6 to view the page, this statement cannot be, for an unknown reason
//The following two sentences should be generated dynamically in the actual use of the database, etc.
Response. Write ("user-agent: * \ n");
Response. Write ("Disallow:/news/231.html \ n");
//refers to a static robots file content, which stores the content of the mask that does not change
Response. WriteFile ("~/static-robots.txt");
Response. Flush ();
}
Public BOOLIsReusable {
Get{
return false;
}
}
}
The general handler implements the IHttpHandler, in the previous Urlrewrite section, we talked about HttpModule, in fact, in the application life cycle of ASP, there is a concept called "Pipeline (Pipeline)": An HTTP request, After one has a HttpModule "filter/Processing", eventually reached a Httphandle "processor" part, HttpModule and Httphandle formed a "pipeline", very image yo, hehe. Put a picture on it:
If you are unfamiliar with it, look at the source code of the page, you will find that the page also implements the IHttpHandler, so the *.aspx file is the most commonly used httphandle. But page is not just a httphandler, it also embeds complex page life-cycle events, so from a resource-saving perspective, I can often use a custom, lighter-weight *.ashx file () to do some simple work. Kazuo like a txt file, we can also generate CAPTCHA (jpg files), XML files, etc.
And then another thing to do is to do urlrewrite:
Code voidApplication_BeginRequest (Objectsender, EventArgs e)
{
//Code that runs on application startup
HttpContext context=HttpContext.Current;
stringcurrentlocation=context. Request.Url.LocalPath;
if(Currentlocation.tolower ()== "/website1/robots.txt")
{
Context. RewritePath ("~/handler.ashx");
}
}
In this way, the spider will think that there is a robots.txt file in the root directory of the website.
2. Add the meta tag to the page you want to block
<meta id= "meta" name= "Robots" content= "Noindex,nofollow"/>
Noindex means that the page cannot be indexed
Nofollow means that the page cannot be "followed" (will be explained in detail in SEO hack)
This is the effect of a static page and is fairly straightforward if you need to generate it dynamically:
Code protected voidPage_Load (Objectsender, EventArgs e)
{
Htmlmeta Meta= NewHtmlmeta ();
Meta. Name= "Robots";
Meta. Content= "Noindex,nofollow";
This. HEADER.CONTROLS.ADD (meta);
}Meta can also specify description, keyword, etc., the technical implementation is the same.
So, how do we choose between these two ways? Some of my suggestions:
1. Try to use robots.txt, which can not only reduce the load on the website (although very small, hehe), because the spider looked at the robots.txt file, will not request the related pages blocked, and if the use of meta-mode, the spider must first request the page, and then do not retrieve the judgment, then the HTTP request has been issued, the service The resource on the server side has been wasted; In addition, if too many meta-shielding, the spider will make a bad impression on the site, reduce or abandon the site of the search included;
2. robots.txt text matches from left to right, there is no regular match! So sometimes we have to use meta-mode. As the URL of the beginning of our article:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html
Finally, some caveats:
1. Do not use the same keyword and discription on all pages, which is a mistake we can easily make, Although Articles.aspx is a page, but with the URL parameter, it becomes thousands of pages, if you write dead keyword and discription on the page, that will make the thousands of pages are the same keyword and discription!
2. Try to avoid using URL-based SessionID. Asp. NET in the case of disabling cookies on the client, you can set the use of URL-based SessionID, similar to the following:
http://www.freeflying.com/(S (c3hvob55wirrndfd564))/articles.aspx
Asp. NET SEO: Using the. ashx file--Excluding duplicate content