This series of directories
If different links direct to a page with a large number of identical content, this phenomenon will be called "duplicate content". If a website has a large number of duplicate content, search engines will think that the value of this website is not high. Therefore, we should try to avoid repeated content.
The duplicate content of a dynamic website is often caused by URL parameters, and URL rewriting will deteriorate (this is intriguing ). If the original URL parameter is used, the search engine may make proper judgments. If the duplicate content is found to be caused by the URL parameter, the system automatically processes the content; URL rewriting masks URL parameters, but makes the search engine unable to recognize URL parameters. For example:
Original URL:
Http://www.freeflying.com/articles.aspx? Id = 231 & Catelog = Blog
Http://www.freeflying.com/articles.aspx? Id = 231 & Catelog = News
URL after URL rewriting:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html
These URLs actually point to the same page content, which is the one with ID = 231.ArticleHowever, this article is referenced by blog and news. For various reasons, the final URL is as shown above.
There are two solutions: one is to use the robot protocol to "exclude" one, and the other is to permanently redirect one URL to another through 301.
Today, let's talk about the robot protocol. To put it simply, robot refers to a search engine. for Google, we call it "Spider )". Spider is very polite. Before crawling Your webpage content, you will first ask for your opinion. You and the robot have previously communicated Based on the robot protocol. There are two methods for implementation:
1. Add a robots.txt text to the root directory of the website, for example:
# Static content, forbid all the pages under the "admin" folder
User-Agent :*
Disallow:/admin
# line comment;
User-Agent indicates a search engine. * Indicates a specific search engine for all search engines, such as user-AGENT: googlebot;
disallow specifies a directory or page that cannot be accessed. Note: 1. this article is case sensitive; 2. the website root directory must start with "\".
like this series, we will focus on ASP. NET technology. For more information about the robots.txt text, see the http://www.googlechinawebmaster.com/2008/03/robotstxt.html.
but how can we dynamically generate this file (there are actually a lot of such requirements )? I/O operations, write a TXT file in the root directory ......, However, you can also use the Program (. Code : Code code highlighting produced by actipro codehighlighter (freeware)
http://www.CodeHighlighter.com/
--> <% @ webhandler language = " C # " class = " handler " %>
UsingSystem;
UsingSystem. Web;
Public ClassHandler: ihttphandler {
Public VoidProcessrequest (httpcontext context ){
Httpresponse response = Context. response;
Response. Clear ();
// Response. contenttype = "text/plain"; if you want to use IE6 to view the page, you cannot use this statement for unknown reasons.
// The following two statements should be dynamically generated by databases in actual use.
Response. Write ( " User-Agent: * \ n " );
Response. Write ( " Disallow:/news/231.html \ n " );
//Reference the content of a static robots file, which stores unaltered blocked content
Response. writefile ("~ Static-robots.txt");
Response. Flush ();
}
Public BoolIsreusable {
Get{
Return False;
}
}
}
The general handler implements ihttphandler. In the previous urlrewrite section, we talked about httpmodule, in fact, in ASP. in the Application lifecycle of net, there is a concept called "Pipeline": an HTTP request goes through the "filter/process" of an httpmodule ", finally, an httphandle "processor" is reached. httpmodule and httphandle form a "Pipeline", which is very vivid. Paste a picture:
If you are still unfamiliar with it, viewSource codePage also implements ihttphandler, so *. aspx files are the most commonly used httphandle. However, a page is not only an httphandler, but also embedded with complex page lifecycle events. Therefore, from the perspective of resource conservation, many times I can also use custom, more lightweight *. to complete some simple work. Similar to a TXT file, we can also generate verification codes (JPG files) and XML files.
Then, you need to perform urlrewrite: Code Void Application_beginrequest ( Object Sender, eventargs E)
{
// Code that runs on application startup
Httpcontext Context = Httpcontext. Current;
String Currentlocation = Context. Request. url. localpath;
If(Currentlocation. tolower ()= "/Website1/robots.txt")
{
Context. rewritepath ("~ /Handler. ashx");
}
}
In this case, a robots.txt file exists in the root directory of the website.
2. Add the meta tag to the page to be blocked
<Meta id = "meta" name = "Robots" content = "noindex, nofollow"/>
Noindex means the page cannot be indexed
Nofollow means that the page cannot be "followed" (detailed in Seo hack)
This is the effect of static pages. If dynamic generation is required, it is quite simple: Code Protected Void Page_load ( Object Sender, eventargs E)
{
Htmlmeta = New Htmlmeta ();
Meta. Name = " Robots " ;
Meta. Content = " Noindex, nofollow " ;
This . Header. Controls. Add (Meta );
}
Description and keyword can also be specified in Meta, and their technical implementation is the same.
So how can we choose the two methods? Some of my suggestions:
1. after the hosts file is deleted, the system will no longer request the blocked pages. If Meta is used, the spider must first request the page and then make an unretrieved judgment, at this time, the HTTP request has been sent, and the resources on the server have been wasted. In addition, if too many meta blocking operations are performed, the spider may not be very impressed with the website, reduce or discard the indexing of the website;
2. When the robots.txt text is matched from left to right, there is no regular match here! So sometimes we have to use meta. For example, the URL at the beginning of our article:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html
Finally, let's talk about some precautions:
1. do not use the same keyword and discription on all pages. This is an easy mistake, though Articles. aspx is a page, but with the URL parameter added, it becomes thousands of pages. If you write the keyword and discription on the page, that will make these thousands of pages use the same keyword and discription!
2. Avoid using URL-based sessionid. ASP. net can be set to use the URL-based sessionid when the client disables the cookie, the effect is similar to:
http://www.freeflying.com/(S (c3hvob55wirrndfd564)/articles. aspx