Http://www.imkevinyang.com/2009/08/%E8%AF%A6%E8%A7%A3javascript%E4%B8%AD%E7%9A%84url%E7%BC%96%E8%A7%A3%E7%A0%81.html
Summary
In this paper, we mainly introduce the related problems of URI codec, and explain in detail which characters in URL coding need to be encoded, why they need to be encoded, and compare and analyze several pairs of functions of JavaScript and codec related Escape/unescape,encodeuri/ decodeURI and Encodeuricomponent/decodeuricomponent.
Pre-knowledge
Foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/ \________/\_________/ \__/
| | | | |
Scheme Authority path Query fragment
A URI is the meaning of a Uniform resource identifier, and usually the URL we call is just one of the URIs. The format of the typical URL is as shown above. The URL code mentioned below should actually refer to the URI encoding.
Why URL encoding is required
Usually if something needs to be coded, it means that something is not suitable for transmission. There are a variety of reasons, such as size too large to contain private data, and for URLs, the reason for encoding is because some characters in the URL cause ambiguity .
For example, the URL parameter string uses the Key=value key value pair in such a way that the key value pairs are separated by A & symbol, such as/s?q=abc& Ie=utf-8. If your value string contains = or &, then it is bound to cause the server parsing error to receive the URL, so the ambiguous & and = symbol must be escaped, that is, encoded.
Another example is that the URL is encoded in ASCII instead of Unicode, which means that you cannot include any non-ASCII characters in the URL, such as Chinese. Otherwise, Chinese can cause problems if the client browser and the server-side browser support different character sets.
The principle of URL encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent unsafe characters.
Which characters need to be encoded
The RFC3986 documentation stipulates that only English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL.
The RFC3986 document makes a detailed recommendation on the encoding and decoding of URLs, indicating which characters need to be encoded to not cause a change in URL semantics, and explain why these characters need to be encoded.
There are no printable characters in the us-ascii character set
Only printable characters are allowed in the URL. The 10-7f bytes in the US-ASCII code all represent control characters that do not appear directly in the URL. Also, for 80-ff bytes (iso-8859-1), the byte range defined by the US-ACII has been exceeded and therefore cannot be placed in the URL.
Reserved characters
URLs can be divided into several components, protocols, hosts, paths, and so on. There are some characters (:/?#[]@) that are used to separate different components. For example: colons are used to separate protocols and hosts,/for separating hosts and paths, for separating paths and query parameters, and so on. There are also characters (!$& ' () *+,;=) that are used to delimit each component, such as = used to represent key-value pairs in query parameters,& symbols are used to separate queries for multiple key-value pairs. When normal data in a component contains these special characters, it needs to be encoded.
The following characters are reserved characters in RFC3986:
! |
* |
‘ |
( |
) |
; |
: |
@ |
& |
= |
+ |
$ |
, |
/ |
? |
# |
[ |
] |
Unsafe characters
There are also some characters that may cause ambiguity in the parser when they are placed directly in the URL. These characters are considered unsafe characters for a number of reasons.
Space |
URL in the process of transmission, or the user in the process of typesetting, or text handlers in the process of processing URLs, it is possible to introduce insignificant spaces, or the meaningful spaces to remove |
quotation marks and <> |
quotation marks and angle brackets are commonly used to delimit URLs in plain text |
# |
Typically used to represent bookmarks or anchor points |
% |
The percent semicolon itself is used as a special character to encode unsafe characters, so it needs to be encoded |
{}|\^[]`~ |
Some gateways or transport agents will tamper with these characters |
It is important to note that for legitimate characters in URLs, encoding and non-coding are equivalent, but for the above mentioned characters, they may cause different URL semantics if they are not encoded. Therefore, for URLs, only ordinary English characters and numbers, special character $-_.+!* ' () and reserved characters, can appear in the URL without encoding . All other characters need to be encoded before they appear in the URL.
However, due to historical reasons, there are still some non-standard coding implementations. For example, the ~ symbol, although the RFC3986 document stipulates that for the wave symbol ~, does not require URL encoding, but there are still many old gateways or transport agents will
How to encode an illegal character in a URL
URL encoding is also commonly referred to as a percent-encoding (URL encoding,also known as percent-encoding) because it is encoded in a very simple way, using the% percent sign plus the two-bit character--0123456789abcdef-- Represents a 16 binary form of a byte. The default character set used by URL encoding is US-ASCII. For example A in the US-ASCII code in the corresponding byte is 0x61, then the URL encoding is%61, we enter http://g.cn/search?q=%61%62%63 on the address bar, in fact, the equivalent of searching for ABC on google. Another example of the @ symbol in the ASCII character set of the corresponding byte is 0x40, after the URL encoded by the%40.
List of URL encodings for common characters:
URL encoding of reserved characters
! |
* |
" |
' |
( |
) |
; |
: |
@ |
& |
%21 |
%2a |
%22 |
%27 |
%28 |
%29 |
%3b |
%3a |
% 40 |
%26 |
= |
+ |
$ |
, |
/ | td>?
% |
# |
[ |
] |
%3d |
%2b |
%24 |
%2c |
%2f |
%3f |
%25 |
%23 |
%5b |
%5d |
for non-ASCII characters, a superset of the ASCII character set is required to encode the corresponding bytes, and then the percent code is executed for each byte . For Unicode characters, the RFC document recommends using UTF-8 to encode the corresponding bytes, and then perform a percent-encoding on each byte. For example, "Chinese" uses the UTF-8 character set to get the byte 0xe4 0xb8 0xAD 0xe6 0x96 0x87, after URL encoding to get "%e4%b8%ad%e6%96%87".
if a byte corresponds to a non-reserved character in the ASCII character set, this byte does not need to be represented by a percent sign . For example, "URL encoding", the bytes obtained using UTF-8 encoding is 0x55 0x72 0x6c 0xE7 0xBC 0x96 0xE7 0xA0 0x81, because the first three bytes correspond to the non-reserved character "url" in ASCII, so these three bytes can be used non-reserved character "url" Said. The final URL encoding can be simplified to "url%e7%bc%96%e7%a0%81", of course, if you use "%55%72%6c%e7%bc%96%e7%a0%81" is also possible.
For historical reasons, there are some URL encoding implementations that do not fully follow this principle, as mentioned below.
The difference between Escape,encodeuri and encodeuricomponent in JavaScript
JavaScript provides 3 pairs of functions used to encode URLs to get a valid URL, respectively, Escape/unescape,encodeuri/decodeuri and encodeURIComponent/ decodeURIComponent. Since the decoding and encoding process is reversible, the encoding process is only explained here.
These three coded function--escape,encodeuri,encodeuricomponent--are used to convert unsafe, illegal URL characters to legitimate URL characters, and they have several differences.
Different security characters
The following table lists the security characters for these three functions (that is, the function does not encode these characters)
|
Safe characters |
Escape (69) |
*/@+-._0-9a-za-z |
encodeURI (82) |
!#$& ' () *+,/:; [Email protected]_~0-9a-za-z |
encodeURIComponent (71) |
!‘ () *-._~0-9a-za-z |
Different compatibility
The escape function existed from the time of Javascript1.0, and the other two functions were introduced in Javascript1.5. But since Javascript1.5 is already very popular, there is no compatibility problem with encodeURI and encodeuricomponent in practice.
Unicode characters are encoded differently
These three functions are encoded in the same way as ASCII characters, and are denoted by a percent + two-bit hexadecimal character. However, for Unicode characters, Escape is encoded in%uxxxx, where xxxx is the 4-bit hexadecimal character used to represent Unicode characters. This approach has been abandoned by the "the". However, this encoding syntax for escape is still maintained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters before they are percent-encoded . This is the RFC recommendation. It is therefore advisable to use these two functions instead of escape for encoding whenever possible.
Suitable for different occasions
encodeURI is used to encode a complete URI, while encodeURIComponent is used as a component of the URI.
Judging from the list of safe character ranges mentioned above, we will find that the encodeURIComponent encodes a larger range of characters than encodeURI. As we mentioned above, reserved characters are generally used to separate the URI component (a URI can be cut into multiple components, refer to the Preparatory Knowledge section) or a subcomponent (such as a delimiter for query parameters in a URI), such as: number used to separate scheme and host, the number is used to separate the host and path. Since the object manipulated by encodeURI is a complete URI, these characters are inherently special in URIs, so these reserved characters are not encoded by encodeURI, otherwise the meaning is changed.
The component has its own data representation format, but the data inside cannot contain reserved characters with delimited components, otherwise it will cause the separation of components in the entire URI to be confusing. Therefore, for a single component to use encodeURIComponent, you need to encode more characters.
Form submission
When an HTML form is submitted, each form field is URL-encoded before it is sent. For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for whitespace is not% 20, but the + number, if the form is submitted using the Post method, we can see a content-type header in the HTTP header with a value of application/ X-www-form-urlencoded. Most applications can handle this non-standard implementation of URL encoding, but in client JavaScript, there is no function to decode the + number into a space, only write the conversion function itself. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example, we add the HTML header
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "/>
The browser will then use GB2312 to render the document (note that when the META tag is not set in the HTML document, the browser automatically chooses the character set based on the current user preferences, and the user can also force the current site to use a specified character set). When a form is submitted, the character set used by the URL encoding is gb2312.
Does the document character set affect encodeURI?
Before using Aptana (why specifically referred to as Aptana below) encountered a very confusing problem, that is, when using encodeURI, found that it encoded the results and I think very different. Here is my sample code:
<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ">
Run the result output%e6%b6%93%ee%85%9f%e6%9e%83. Obviously this is not the result of URL encoding using the UTF-8 character set (search for "Chinese" on Google,%e4%b8%ad%e6%96%87 is shown in the URL).
So I was very skeptical, encodeURI is also related to the page encoding, but I found that, under normal circumstances, if you use gb2312 URL encoding will not get this result. Finally I found that the page file store used the character set and the META tag specified in the character set inconsistency caused the problem . The Aptana editor uses the UTF-8 character set by default. This means that the file is actually stored using the UTF-8 character set. However, since the META tag specifies the gb2312, this time, the browser will follow the gb2312 to parse the document, then naturally in the "Chinese" this string will be error, because the "Chinese" string with UTF-8 encoding after the resulting byte is 0xe4 0xb8 0xAD 0xe6 0x96 0x87, these 6 bytes again by the browser gb2312 to decode, then will get another three Chinese characters "Juan Po" (GBK in a Chinese character accounted for two bytes), the three characters in the encodeURI function after the results are%e6%b6%93%ee%85%9f% e6%9e%83. Therefore, encodeURI uses UTF-8, and is not affected by the page character set.
Other issues related to URL encodingFor the handling of URLs containing Chinese, different browsers have different performance. For example, for IE, if you tick the advanced settings "always send URL with UTF-8", then the Chinese portion of the path portion of the URL is sent to the server using UTF-8 and the Chinese part of the query parameter is URL-encoded using the system default character set. To ensure maximum interoperability, it is recommended that all components placed in the URL explicitly specify a character set for URL encoding, rather than relying on the default implementation of the browser.
In addition, many HTTP monitoring tools, such as the browser address bar, will automatically decode the URL once (using the UTF-8 character set) when the URL is displayed, which is why the URL displayed in the address bar contains Chinese when you visit Google search Chinese in Firefox. But the original URL that is actually sent to the server is still encoded. You can see it by using JavaScript on the address bar to access the location.href. Don't be fooled by these illusions when researching URL codecs.
URL encoding (percentcode percent code)