If an encoding is required, it indicates that this is not suitable for transmission. There are a variety of reasons, such as a large size that contains private data. For a URL, encoding is required because some characters in the URL may cause ambiguity.
For example, if the URL parameter string uses the key = value Key-value pair to pass the parameter, key-value pairs are separated by the & symbol, such as/s? Q = ABC & Ie = UTF-8. If your value string contains = or &, it will inevitably cause parsing errors on the server that receives the URL. Therefore, you must escape the ambiguous & and = symbols, that is, encode it.
Another example,The URL encoding format uses ASCII code instead of Unicode.This means that you cannot contain any non-ASCII characters in the URL, such as Chinese. Otherwise, if the character set supported by the client browser and the Server Browser is different, Chinese may cause problems.
The URL encoding principle is to use secure characters (printable characters without special purposes or special meanings) to indicate insecure characters.
Prerequisites: URI indicates a Uniform Resource Identifier. Generally, the URL is only a type of Uri. The format of a typical URL is as follows. The URL encoding mentioned below actually refers to Uri encoding.
Foo:
// Example.com: 8042/over/there? Name = ferret # nose
\_/\______________/\________/\_________/\__/
|
Scheme Authority path query Fragment
Characters to be encoded
Rfc3986 documents stipulate that the URL can only contain letters (A-Za-z), numbers (0-9), and ),-_.~ 4 special characters and all reserved characters. Rfc3986 provides detailed recommendations on URL encoding and decoding, and points out which characters need to be encoded to avoid URL semantic changes, and explains why these characters need to be encoded.
No printable character in the US-ASCII Character Set: only printable characters are allowed in the URL.The 10-1f bytes in the US-ASCII code all represent control characters that cannot appear directly in the URL. At the same time,80-ff byte (ISO-8859-1), which is beyond the byte range defined by the US-ACII, cannot be placed in the URL.
Reserved characters:The URL can be divided into several components, such as the Protocol, host, and path. There are some characters (:/? # [] @) Is used to separate different components. For example, a colon is used to separate the protocol and host, And/is used to separate the host and path ,? Used to separate paths and query parameters. AndSome characters (! $ & '() * +,; =) Is used to separate each componentFor example, "=" indicates the key-value pairs in the query parameters, and "&" is used to separate and query multiple key-value pairs. When common data in a component contains these special characters, it must be encoded.
Rfc3986 specifies the following characters as reserved characters :! * '();: @ & = + $ ,/? # []
Unsafe characters: There are some characters that may cause parsing when they are directly placed in the URL.Program. These characters are considered unsafe for many reasons.
- Space: During the URL transmission process, the user's typographical process, or the Text Processing Program's URL Processing Process, there may be irrelevant spaces, or remove the meaningful spaces.
- Quotation marks and <>: quotation marks and angle brackets are usually used to separate URLs in common text.
- #: Used to indicate bookmarks or anchor points
- %: Percent signs are special characters used to encode unsafe characters. Therefore, they must be encoded.
- {}| \ ^ [] '~ : Some gateway or transport proxy will tamper with these characters
It should be noted that for valid characters in the URL, encoding and non-encoding are equivalent, but for those characters mentioned above, if they are not encoded, then they may cause different URL semantics. ThereforeFor URLs, only common English characters and numbers are supported. special characters include $-_. +! * '() And reserved characters can appear in unencoded URLs.. Other characters must be encoded before they can appear in the URL.
However, due to historical reasons, there are still some nonstandard coding implementations. For example ~ Symbol, although rfc3986 documents stipulate that ~, There is no need for URL encoding, but there are still many old gateways or transmission proxies that will be encoded.
How to encode invalid characters in a URL
URL encoding is also known as percent code.(URL encoding, also known as percent-encoding), because its encoding method is very simple,Use the % percent sign plus two characters -- 0123456789abcdef -- to represent the hexadecimal form of a byte.. The default Character Set of URL encoding is US-ASCII. For example, if the byte of A in the US-ASCII code is 0x61, then what we get after URL encoding is % 61, we enter the http://g.cn/search in the address bar? Q = % 61% 62% 63 is actually equivalent to searching ABC on Google. For example, if the byte of the @ symbol in the ASCII character set is 0x40, % 40 is obtained after URL encoding.
For non-ASCII characters, the super set of the ASCII character set must be used for encoding to obtain the corresponding bytes, and then each byte must be percent encoded.. For Unicode characters, we recommend that you use UTF-8 to encode them to obtain the corresponding bytes, and then perform percent encoding for each byte. For example, "Chinese" using the UTF-8 character set to get the byte 0xe4 0xb8 0xad 0xe6 0x96 0x87, after URL encoding, "% E4 % B8 % ad % E6 % 96% 87" is obtained ".
If a byte corresponds to a non-reserved character in the ASCII character set, this Byte does not need to be expressed by a percent sign. For example, "url encoding", the bytes produced by UTF-8 encoding are 0x55 0x72 0x6c 0xe7 0xbc 0x96 0xe7 0xa0 0x81, because the first three bytes correspond to the non-reserved characters "url" in ASCII, these three bytes can be represented by non-reserved characters "url. The final URL encoding can be simplified to "url % E7 % BC % 96% E7 % A0 % 81". Of course, you can also use "% 55% 72% 6C % E7 % BC % 96% E7 % A0 % 81.
Due to historical reasons, some URL encoding implementations do not fully follow this principle, which will be mentioned below.
Differences between escape, encodeuri, and encodeuricomponent in Javascript
Javascript provides three functions for URL encoding to obtain valid urls: escape/Unescape, encodeuri/decodeuri, and encodeuricomponent/decodeuricomponent. Since the decoding and encoding processes are reversible, only the encoding processes are described here.
The three encoded functions-escape, encodeuri, and encodeuricomponent-are used to convert insecure and invalid URL characters into valid URL characters. They have the following differences.
Different security characters:
The security characters of these three functions are listed below (that is, the functions do not encode these characters)
- Escape (69): */@ +-. _ 0-9a-za-z
- Encodeuri (82 ):! # $ & '() * +,/:; =? @-._~ 0-9a-za-z
- Encodeuricomponent (71 ):! '()*-._~ 0-9a-za-z
Different compatibility: the escape function exists from JavaScript 1.0, and the other two functions are introduced only in Javascript 1.5. However, JavaScript 1.5 is already very popular, so there is no compatibility problem when using encodeuri and encodeuricomponent.
Unicode characters are encoded in the same way: the three functions use percent signs + two hexadecimal characters for the same ASCII character encoding. However, for Unicode characters, the escape encoding method is % uxxxx, where XXXX is a four-digit hexadecimal character used to represent Unicode characters. This method has been abandoned by W3C. But this encoding syntax for escape remains in the ECMA-262 standard. Encodeuri and encodeuricomponent encode non-ASCII characters using the UTF-8 before percent encoding. This is recommended by RFC. Therefore, we recommend that you use these two functions instead of escape for encoding.
Different application scenarios: encodeuri is used to encode a complete Uri, while encodeuricomponent is used to encode a component of Uri. From the security character range table above, we can see that the character range of encodeuricomponent encoding is larger than that of encodeuri. As mentioned above, reserved characters are generally used to separate URI components (a URI can be cut into multiple components. For details, refer to the preparations section) or sub-components (such as delimiters of query parameters in URI), such as: used to separate scheme and host ,? Number is used to separate the host and path. Since the object operated by encodeuri is a complete Uri, these characters have special purposes in the URI, so these reserved characters are not encoded by encodeuri, otherwise the meaning changes.
The component has its own data representation format, but the data cannot contain the reserved characters of the separator component. Otherwise, the component separation in the entire URI is disordered. Therefore, for a single component to use encodeuricomponent, more characters need to be encoded.
Form submission
When an HTML form is submitted, each form field is URL encoded before being sent. Due to historical reasons, the URL encoding Implementation of the form does not comply with the latest standards. For example, the space encoding is not "% 20", but "+". If the form is submitted using the POST method, we can see a Content-Type header in the HTTP header, the value is application/X-WWW-form-urlencoded. Most applications can handle this non-standard implementation of URL encoding. However, in the client JavaScript, no function can decode the "+" into a space and write the Conversion Function by itself. Also, for non-ASCII characters, the encoding character set used depends on the character set used in the current document. For example, we add
<MetaHTTP-equiv= "Content-Type"Content= "Text/html; charset = gb2312" />
In this way, the browser will use gb2312 to render this document (NOTE: When this meta tag is not set in the HTML document, the browser will automatically select the character set according to the current user's preferences, you can also force the current website to use a specified character set ). When a form is submitted, the character set used for URL encoding is gb2312.
I encountered a very confusing problem when I used Aptana (why I mentioned it below), that is, when I used encodeuri, I found that the encoded results were very different from what I thought. Below is my exampleCode:
<! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en"
Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
< Html Xmlns = "Http://www.w3.org/1999/xhtml" >
< Head >
< Meta HTTP-equiv = "Content-Type" Content = "Text/html; charset = gb2312" />
</ Head >
< Body >
< Script Type = "Text/JavaScript" >
Document. Write (encodeuri ( " Chinese " ));
</ Script >
</ Body >
</ Html >
Run result output % E6 % B6 % 93% ee % 85% 9f % E6 % 9e % 83. Obviously this is not the result of URL encoding using the UTF-8 character set (search for "Chinese" on Google and the URL displays % E4 % B8 % ad % E6 % 96% 87 ).
So I was very skeptical at the time. Is encodeuri still related to page encoding, but I found that under normal circumstances, if you use gb2312 for URL encoding, you will not get this result. I finally found out that the problem was caused by inconsistency between the character set used by page file storage and the character set specified in the meta tag. The editor of Aptana uses the UTF-8 character set by default. That is to say, this file is actually stored using the UTF-8 character set. However, because gb2312 is specified in the meta tag, the browser will parse this document according to gb2312, and errors will naturally occur in the string "Chinese, because the "Chinese" string is encoded with the UTF-8, the byte is 0xe4 0xb8 0xad 0xe6 0x96 0x87, the 6 byte is decoded by the browser with gb2312, then we will get the other three Chinese characters "Juan" (one Chinese Character occupies two bytes in GBK ), after the three Chinese characters are passed into the encodeuri function, the result is % E6 % B6 % 93% ee % 85% 9f % E6 % 9e % 83. As a result, encodeuri uses a UTF-8 and is not affected by the page character set.
Different browsers have different processing problems for URLs containing Chinese characters. For example, for IE, if you check Advanced Settings "always send URL with UTF-8", then the Chinese language of the path section in the URL will be URL encoded using the UTF-8 and sent to the server, the Chinese part of the query parameters uses the default character set for URL encoding. To ensure maximum interoperability, it is recommended that all components put into the URL explicitly specify a character set for URL encoding, without relying on the browser's default implementation.
In addition, many HTTP monitoring tools or browser address bar will automatically decode the URL once when displaying the URL (using the UTF-8 Character Set ), this is why the URL displayed in the address bar contains Chinese characters when you access Google to search for Chinese Characters in Firefox. The original URL actually sent to the server is encoded. You can use JavaScript to access location. href in the address bar. Do not be confused by these illusions when studying URL codec.