Character encoding in URL

Source: Internet
Author: User
Tags rfc
From: http://hi.baidu.com/jrckkyy/blog/item/d86c12ecea120c30279791be.html

 

In addition to common letters, numbers, Chinese characters, and special characters, character escaping is used.
"+" The + sign in the URL indicates space % 2B
The space in the "space" URL can be a plus sign or encoded as % 20.
"/" Separates directories and subdirectories % 2f
"? "Separate the actual URL and parameter % 3f
"%" Special character % 25
"#" Indicates bookmarks % 23
"&" Delimiter between the specified parameters in the URL % 26
"=" The value of the specified parameter in the URL % 3d
"\" Indicates the directory path % 5c
"." Full Stop % 2e
":" Colon % 3A

For URL encoding, rfc1738 provides the following rules:

"OnLy alphanumerics [0-9a-za-z], the special characters "$-_. +! * '(), "[Not including the quotes-ed], and reserved Characters Used for their reserved purposes may be used unencoded within a URL ."

RFC then illustrates the meanings of Reserved Words, special characters, and insecure characters-that is, the following three types of characters can appear directly on the URL without being encoded:

  • [0-9a-za-z]
  • Special characters: $-_. +! *'(),
  • Reserved characters: & // :;=? @

To make our thinking clearer, let's further summarize which characters must be encoded:

  • The ASCII table does not have any characters to display, such as Chinese characters.
  • Unsafe characters, including: # "% <> [] {}| \ ^ '~
  • Reserved characters that are not used by reserved characters, that is, & // :;=? @

For more information, see the figure (click to view the big picture ):

URL encoding in ASCII tables

How to encode it?

As we all know, a character can be expressed by octet. The number of eight bytes can be expressed by hexadecimal notation. For example, the eight-byte hexadecimal value of the character "<" is 3C. In the URL, the character is encoded as "%" plus two hexadecimal values of the character. For example:

  • "<" Can be encoded as % 3C, and the Space "SP" can be encoded as "% 20"
  • The hexadecimal value of gb2312 encoding for "Tian" is cc ef, And the URL of "Tian" is encoded as % CC % EF.
  • The hexadecimal value of GBK encoding for "bytes" is 87 E5, And the URL encoding for "bytes" is % 87% E5.
  • The hexadecimal value of the UTF-8 code of "Tian" is E7 94 B0, then the URL code of "Tian" is % E7 % 94% B0
More topics when the URL contains Chinese Characters

Rfc1738 does not specify the encoding method of Chinese characters, but is determined by the browser. Therefore, the URL Chinese character encoding is inconsistent. After research, for the URL "query string" and "path" contains Chinese characters, different browsers have different processing.

1. the query string contains Chinese Characters

Enter: http://www.baidu.com/s? WD = Tian Miao, press enter, and use Fiddler to observe the requests sent by the browser (for example, IE8 and Firefox ):

The query string contains Chinese characters.

IE8 uses Chinese characters as GBK encoding and sends them directly to the server (which does not comply with RFC specifications). Firefox adds % more. The Windows operating system is GBK encoded. It is concluded that when the address bar accesses the URL directly and the Chinese character is used as the query string, ie and Firefox will be sent to the server using the system encoding, and Firefox will be encoded according to the rules.

Note 1: Do not use Google for testing. Google's search URL (for example: http://www.google.com/?hl=en&source=hp&q?) is not a query string in the search keyword, because there is #...... I was confused for a long time ......

Note 2: This is only the rule of direct access to URLs. If you click to open a page from the link, for example, if you open page B from a link containing Chinese characters on page A, the encoding of the Chinese character in the browser depends on the encoding of page.

2. the URL path contains Chinese Characters

Enter http://www.hudong.com/wiki/tian Miao at the url, press enter, and observe the request:

Path contains Chinese Characters

Both IE8 and Firefox use Chinese characters as utf8 and perform URL encoding according to the specifications.

Summary

What characters should be encoded, what characters do not need to be encoded, and the basic problems of URL encoding have been solved.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.