URL Normalization) It is actually a process of standardizing URLs, that is, converting a URL into an equivalent URL that complies with the specifications (for example, converting http://www.cnblogs.com/shuchaoto http://www.cnblogs.com/shuchao/).ProgramIt can be determined that these two URLs are equivalent.
URL normalization is used by the search engine to reduce duplicate indexing of pages, while also reducing repeated crawling of crawlers. The browser also needs to use URL normalization to identify whether a user has accessed a URL.
- 1 URL Composition
- 2 Nonstandard URL
- 3. url standardization process
- 4 Seo URL Standardization
URL composition:
Protocol: // hostname [: Port]/path/[; parameters] [? Query] # Fragment
Protocol: // host name [: Port]/path/[: parameter] [? Query] # Fragment
Nonstandard URL:
1. Extra characters in the URL
1.1 The URL of a subdomain name contains "www": "http :// Www. Shuchao.cnblogs.com /"
1.2 contains default port: "http://www.cnblogs.com : 80 /Shuchao /"
1.3 loose URL: "http://www.chapters.indigo.ca/books/ Amazon-Sucks-donkey-bils /9780470170779 -Item . Html"
More than 1.4 residual file name index.html, default. aspx and so on: "http://www.cnblogs.com/shuchao/ Index.html"
1.5 File Path
(1) redundant "/": "http://www.cnblogs.com/shuchao/ / "
(2) Extra vertex modifier string: "x/y/z/ Http://www.cnblogs.com/ A/B/ Http://www.cnblogs.com /../ Page.html"
1.6 redundant query strings
(1 )? (Empty query string): http://www.cnblogs.com/shuchao ?
(2 )&
(3) useless query variable: http://www.example.com/display? Id = 123 & Fake = fake
2. the URL lacks a string.
2.1 missing "/": "http://www.cnblogs.com/shuchao"
2.2 query string missing name or value: "http://www.example.com/display? Id = "or" http://www.example.com/display? = 123"
3. Other nonstandard URLs
3.1 "http://shuchao.cnblogs.com/" and "http://www.cnblogs.com/shuchao/" are actually the same content
3.2 use IP address instead of domain name
3.3 contains extended characters, case sensitive ("http://www.google.cn/Intl/zh-CN/about.html" and "http://www.google.cn/intl/zh-CN/about.html ")
Mix 3.4 "+" and "% 20"
3.5 query variable Order disorder: "http://www.example.com/test.aspx? Bar = 1 & A = test"
3.6 contains temporary state variables: http://www.example.com/test? Back =/prevpage. aspx
URL standardization process:
1. lowercase URL protocol name and Host Name
Http: // www.example.com/test-> http://www.example.com/test
2. The escape sequence is converted to uppercase because the size of the escape sequence is sensitive.
% 3A-> % 3A
3. Delete fragment (#)
Http://www.example.com/test/index.html#seo> http://www.example.com/test/index.html
4. Delete '? '
Http://www.example.com/test? -> Http://www.example.com/test
5. Delete the default suffix
Http://www.example.com/test/index.html> http://www.example.com/test/
6. Delete unnecessary vertices.
Http://www.example.com/../a/ B /../c/./d.html> http://www.example.com/a/c/d.html
7. Delete unnecessary "www"
Http://www.test.example.com/> http://test.example.com/
8. Sort query Variables
Http://www.example.com/test? Id = 123 & fakefoo = fakebar → http://www.example.com/test? Id = 123 \
9. Delete the variable with the default value.
Http://www.example.com/test? Id = & sort = ascending → http://www.example.com/test
10. Delete unnecessary query strings, Such ?, &
Http://www.example.com/test? → Http://www.example.com/test
11. Dust rules (Heuristic method proposed by schonfeld and others)
Http://www.example.com/test? Id = 123-> http://www.example.com/test_123
Seo URL standardization:
Non-standard URLs may cause many duplicate URLs on the website. As a result, crawlers repeatedly crawl the same content, affecting the effective content of the website and indexing.
Multiple non-standard URLs cause sparse PR, which is originally directed to the PR of the Same page. As a result, multiple non-standard URLs are routed.
There is also a user experience problem. Complicated or nonstandard URLs can easily make users feel bad about the website.
The Google Administrator added a URL normalization tool to delete useless parameters in the URL.