Different links point to the page if there are a lot of the same content, this phenomenon will be called "duplicate content", if a site repeated a lot of content, the search engine will think that the value of the site is not high. So we should try to avoid all kinds of repetitive content.
The repetitive content of dynamic Web sites is often caused by URL parameters, and URL rewriting will worsen this phenomenon (more intriguing yo, hehe). Because if the use of the original URL parameters, the search engine may make appropriate judgments, and learned that the duplicate content is caused by the URL parameters, automatic processing, and URL rewriting mask URL parameters, instead of the search engine can not recognize the URL parameters. Like what:
The original URL: Http://www.freeflying.com/articles.aspx?id=231&catelog=blog Http://www.freeflying.com/articles.aspx?id=231&catelog=news
URL rewritten after URL: Http://www.freeflying.com/blog/231.html Http://www.freeflying.com/news/231.html |
These URLs point to the page content is actually the same, are id=231 that article, but this article by the blog and news two columns, for various reasons of consideration, our final URL or as shown above.
There are two ways to deal with this, one is to "exclude" a Robot (robot) protocol, the other is to permanently redirect one URL to another by 301.
Today we'll talk about the robot agreement. Simply put, robot refers to the search engine, to Google, we call it "spider (spider)". Spiders are polite and will first ask for your advice before crawling the content of your Web page. And you communicated with robot based on the robot protocol. Specific to implementation, there are two ways:
1. Add a robots.txt text to the site root directory, such as:
#static content, forbid all the pages under the "Admin" folder User-agent: * Disallow:/admin |
#行表示注释;
User-agent refers to search engines, * for all search engines, can also specify specific search engines, such as User-agent:googlebot;
Disallow specifies a directory or page that is not allowed to be accessed, note: 1. This text is case-sensitive; 2. You must start with "\" to represent the site root directory;
As with the purpose of this series, we focus on asp.net technology. So for more robots.txt text notes, check out the http://www.googlechinawebmaster.com/2008/03/robotstxt.html
But how do we generate this file dynamically (a lot of demand in fact)? Perhaps we immediately think of is I/O operation, in the root directory to write a txt file ..., but there is also a way to: use a generic handler (. ashx file), the code is as follows:
| <%@ WebHandler language= "C #" class= "Handler"%>
Using System; Using System.Web;
public class Handler:ihttphandler {
public void ProcessRequest (HttpContext context) {
HttpResponse response = context. Response;
Response. Clear ();
Response. ContentType = "Text/plain"; If you want to use IE6 to view the page, this is not a statement, for unknown reasons
The following two sentences in the actual use of the database should be dynamically generated Response. Write ("user-agent: * \ n"); Response. Write ("Disallow:/news/231.html \ n");
Refers to a static file containing content that will not change the contents of the screen Response. WriteFile ("~/static-robots.txt");
Response. Flush (); } public bool IsReusable { get { return false; } }
} |
The general handler implements IHttpHandler, and in the previous Urlrewrite section we talked about HttpModule, in fact, in the ASP.net application lifecycle, there is a concept called "piping (Pipeline)": An HTTP request, After a httpmodule of a "filter/process", eventually reached a Httphandle "processor" part, HttpModule and Httphandle formed a "pipeline", very image yo, hehe. Put a picture on it: