For URL encoding, rfc1738 provides the following rules:
"Only alphanumerics [0-9a-za-z], the special characters" $-_. +! * '(), "[Not including the quotes-ed], and reserved Characters Used for their reserved purposes may be used unencoded within a URL ."
RFC then illustrates the meanings of Reserved Words, special characters, and insecure characters-that is, the following three types of characters can appear directly on the URL without being encoded:
- [0-9a-za-z]
- Special characters: $-_. +! *'(),
- Reserved characters: & // :;=? @
To make our thinking clearer, let's further summarize which characters must be encoded:
- The ASCII table does not have any characters to display, such as Chinese characters.
- Unsafe characters, including: # "% <> [] {}| \ ^ '~
- Reserved characters that are not used by reserved characters, that is, & // :;=? @
For more information, see the figure (click to view the big picture ):
URL encoding in ASCII tables
How to encode it?
As we all know, a character can be expressed by octet. The number of eight bytes can be expressed by hexadecimal notation. For example, the eight-byte hexadecimal value of the character "<" is 3C. In the URL, the character is encoded as "%" plus two hexadecimal values of the character. For example:
- "<" Can be encoded as % 3C, and the Space "SP" can be encoded as "% 20"
- The hexadecimal value of gb2312 encoding for "Tian" is cc ef, And the URL of "Tian" is encoded as % CC % EF.
- The hexadecimal value of GBK encoding for "bytes" is 87 E5, And the URL encoding for "bytes" is % 87% E5.
- The hexadecimal value of the UTF-8 code of "Tian" is E7 94 B0, then the URL code of "Tian" is % E7 % 94% B0
More topics when the URL contains Chinese Characters
Rfc1738 does not specify the encoding method of Chinese characters, but is determined by the browser. Therefore, the URL Chinese character encoding is inconsistent. After research, for the URL "query string" and "path" contains Chinese characters, different browsers have different processing.
1. the query string contains Chinese Characters
Enter: http://www.baidu.com/s? WD = Tian Miao, press enter, and use Fiddler to observe the requests sent by the browser (for example, IE8 and Firefox ):
The query string contains Chinese characters.
IE8 uses Chinese characters as GBK encoding and sends them directly to the server (which does not comply with RFC specifications). Firefox adds % more. The Windows operating system is GBK encoded. It is concluded that when the address bar accesses the URL directly and the Chinese character is used as the query string, ie and Firefox will be sent to the server using the system encoding, and Firefox will be encoded according to the rules.
Note 1: Do not use Google for testing. Google's search URL (for example: http://www.google.com/?hl=en&source=hp&q?) is not a query string in the search keyword, because there is #...... I was confused for a long time ......
Note 2: This is only the rule of direct access to URLs. If you click to open a page from the link, for example, if you open page B from a link containing Chinese characters on page A, the encoding of the Chinese character in the browser depends on the encoding of page.
2. the URL path contains Chinese Characters
Enter http://www.hudong.com/wiki/tian Miao at the url, press enter, and observe the request:
Path contains Chinese Characters
Both IE8 and Firefox use Chinese characters as utf8 and perform URL encoding according to the specifications.
Summary
What characters should be encoded, what characters do not need to be encoded, and the basic problems of URL encoding have been solved.
Source: http://hi.baidu.com/wely_ton/item/f138b1209787201108750884