Absrtact: Different links point to the page if there is a lot of the same content, this phenomenon will be called duplicate content, if a site repeated a lot of content, search engine will think that the value of this site is not high. So we should try to avoid all kinds of repetition
Different links point to the page if there are a lot of the same content, this phenomenon will be called "duplicate content", if a site repeated a lot of content, the search engine will think that the value of the site is not high. So we should try to avoid all kinds of repetitive content.
The repetitive content of dynamic Web sites is often caused by URL parameters, and URL rewriting will worsen this phenomenon (more intriguing yo, hehe). Because if the use of the original URL parameters, the search engine may make appropriate judgments, and learned that the duplicate content is caused by the URL parameters, automatic processing, and URL rewriting mask URL parameters, instead of the search engine can not recognize the URL parameters. Like what:
Original URL:
Http://www.freeflying.com/articles.aspx?id=231&catelog=blog
Http://www.freeflying.com/articles.aspx?id=231&catelog=news
URL rewritten after URL:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html
These URLs point to the page content is actually the same, are id=231 that article, but this article by the blog and news two columns, for various reasons of consideration, our final URL or as shown above.
There are two ways to deal with this, one is to "exclude" a Robot (robot) protocol, the other is to permanently redirect one URL to another by 301.
Today we'll talk about the robot agreement. Simply put, robot refers to the search engine, to Google, we call it "spider (spider)". Spiders are polite and will first ask for your advice before crawling the content of your Web page. And you communicated with robot based on the robot protocol. Specific to implementation, there are two ways:
1. Add a robots.txt text to the site root directory, such as:
#static content, forbid all the pages under the "Admin" folder
User: *
Disallow: admin
#行表示注释;
User refers to search engines, * for all search engines, can also specify specific search engines, such as User-agent:googlebot;
Disallow specifies directories or pages that are not allowed to be accessed, note: 1. This text is case-sensitive; 2. You must start with "\" to represent the site root directory;
Like the purpose of this series, we focus on asp.net technology. So for more robots.txt text notes, check out the http://www.googlechinawebmaster.com/2008/03/robotstxt.html
But how do we generate this file dynamically (a lot of demand in fact)? Perhaps we immediately think of is I/O operation, in the root directory to write a txt file ..., but there is also a way to: use a generic handler (. ashx file), the code is as follows:
<%@ webhandler language= "C #" class= "Handler" &NBSP;%>
Using system;
using system.web;
public class handler : ihttphandler {
public void ProcessRequest (httpcontext context) {
httpresponse response = context. Response;
Response. Clear ();
&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP
//response. contenttype = "filetype"; if you want to use IE6 to view the page, you cannot make this statement for reasons unknown
&NBSP;&NBSP;&NBSP
//The following two sentences should be used in the actual use of the database dynamically generated
response. Write ("USER-AGENT:&NBsp;* \n ");
response. Write ("disallow: /news/231.html \n"); The
//references a static file content containing a mask that will not change
response. WriteFile ("~/static-robots.txt");
response. Flush ();
}
public bool isreusable {
get {
return false;
}
}
}
The general handler implements IHttpHandler, and in the previous Urlrewrite section we talked about HttpModule, in fact, in the ASP.net application lifecycle, there is a concept called "piping (Pipeline)": An HTTP request, After a httpmodule of a "filter/process", eventually reached a Httphandle "processor" part, HttpModule and Httphandle formed a "pipeline", very image yo, hehe. Put a picture on it:
If you are unfamiliar with it, look at page source code, you will find that page also implemented IHttpHandler, so *.aspx file is the most commonly used httphandle. But page is not just a httphandler, it also embeds complex page life cycle events, so from a resource-saving perspective, I can often use a custom, lighter-weight *.ashx file () to do some simple work. and into a TXT file similar, we can also generate verification code (JPG file), XML file, etc.
Then one more thing to do is to Urlrewrite:
void Application_BeginRequest (object sender, EventArgs e)
{
Code that SETUPCL on creator startup
HttpContext context = HttpContext.Current;
String currentlocation = context. Request.Url.LocalPath;
if (currentlocation.tolower () = = "/website1/robots.txt")
{
Context. RewritePath ("~/handler.ashx");
}
}
In this way, the spider will assume that there is a robots.txt file in the root directory of the site.
2. Add in Meta tags for pages that need to be screened
<meta id= "meta" name= "Robots" content= "Noindex,nofollow"/>
Noindex means that the page cannot be indexed
Nofollow means that the page cannot be "followed" (will be explained in detail in SEO hack)
This is the effect of static pages, and if you need to generate them dynamically, it's fairly straightforward:
protected void Page_Load (object sender, EventArgs e)
{
Htmlmeta meta = new Htmlmeta ();
Meta. Name = "robots";
Meta. Content = "Noindex,nofollow";
This. HEADER.CONTROLS.ADD (meta);
}
Meta can also specify description, keyword, etc., and its technical implementation is the same.
So, how do we choose between two ways? Some of my suggestions:
1. Use robots.txt as much as possible, which reduces the load on the site (albeit very small, since the spider looked at the robots.txt file, it will no longer request blocked related pages, and if the use of Meta method, the spider must first request the page, and then do not retrieve the judgment, then the HTTP request has been issued, the server-side resources have been waves , in addition, if too much meta shielding, will make spiders bad impression on the site, reduce or abandon the site of the search included;
2. robots.txt text matching from left to right, there is no regular match! So sometimes we have to use the Meta method. The URL that starts with our article:
Http://www.freeflying.com/blog/231.html
Http://www.freeflying.com/news/231.html
Finally, some caveats:
1. Do not use the same keyword and discription on all pages, this is a mistake we make easily, Although Articles.aspx is a page, but with URL parameters, it becomes thousands of pages, if you write dead on the page keyword and discription, that will make these thousands of pages are the same keyword and discription!
2. Avoid using URL based SessionID as much as possible. ASP. NET when a cookie is disabled on the client, you can set up a sessionid using a URL-like effect:
http://www.freeflying.com/(S (c3hvob55wirrndfd564))/articles.aspx
Free flying
Original link: http://www.cnblogs.com/freeflying/archive/2010/02/21/1670758.html