URL Coding and decoding _ practical skills

Source: Internet
Author: User
Tags control characters html form html header printable characters reserved rfc
For example, in a URL parameter string, the Key=value key value is used to pass the argument, and the key value pairs are separated by & symbols, such as/s?q=abc&ie=utf-8. If your value string contains = or, it is bound to result in a server parsing error for the receiving URL, so you must escape the & and = symbol that causes ambiguity, i.e., encode it.

Also, the URL is encoded in an ASCII code, not Unicode, which means you cannot include any non-ASCII characters in the URL, such as Chinese. Otherwise, if the character set supported by the client browser and the server-side browser is different, the Chinese may cause problems.

The principle of URL encoding is to use safe characters (no special purpose or special meaning printable characters) to represent those unsafe characters.

Preliminary knowledge: The URI is the meaning of the Uniform Resource identifier, and usually the URL we are talking about is just one of the URIs. The typical URL is formatted as shown above. The URL encoding mentioned below should actually refer to the URI encoding.

Copy Code code as follows:


\_/ \______________/ \________/\_________/ \__/

Scheme Authority path Query fragment

which characters need to be encoded

The RFC3986 document stipulates that only English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL. The RFC3986 document makes a detailed recommendation on the codec of the URL, pointing out which characters need to be coded to not cause the change of URL semantics and explain why the characters need to be encoded.

There is no corresponding printable character in the us-ascii character set: Only printable characters are allowed in the URL. The 10-7f bytes in the US-ASCII code all represent control characters that cannot appear directly in the URL. Also, for 80-ff bytes (iso-8859-1), it cannot be placed in a URL because it has exceeded the byte range defined by US-ACII.

Reserved characters: URLs can be divided into several components, protocols, hosts, paths, and so on. There are some characters (:/?#[]@) that are used to separate different components. For example, a colon is used to separate protocols and hosts,/to separate hosts and paths, to separate paths and query parameters, and so on. There are also some characters (!$& ' () *+,;=) that play a role in separating each component, such as = used to indicate that the key value in the query parameter pairs the,& symbol to separate the query for multiple key-value pairs. When normal data in a component contains these special characters, it needs to be encoded.

The following characters are specified in RFC3986 as reserved characters:! * ' ( ) ; : @ & = + $,/? # [ ]

Unsafe characters: There are also some characters that may cause ambiguity in the parser when they are placed directly in the URL. These characters are considered unsafe characters for a number of reasons.

Spaces: URL in the process of transmission, or the user in the process of typesetting, or text processing procedures in the process of dealing with URLs, it is possible to introduce insignificant spaces, or to remove those meaningful spaces.
quotation marks and <>: Quotes and angle brackets are commonly used to play the role of separating URLs in plain text
#: Usually used to represent bookmarks or anchor points
%: The percent semicolon itself is used as a special character for encoding unsafe characters, so it itself requires encoding
{}|\^[] ' ~: Some gateways or transport agents will tamper with these characters
It is worth noting that the encoding and not encoding are equivalent to the legal characters in the URL, but for these characters, if they are not encoded, they may cause different URL semantics. Therefore, for URLs, only ordinary English characters and numbers, special character $-_.+!* ' () and reserved characters can appear in the URL without encoding. All other characters need to be encoded before they can appear in the URL.

However, due to historical reasons, there are still some non-standard coding implementations. For example, for the ~ symbol, although the RFC3986 document stipulates that for the wave symbol ~, do not need to do URL encoding, but there are many old gateways or transport agents.

How to encode an illegal character in a URL

URL encoding is also often referred to as a percent encoding (url encoding,also known as percent-encoding) because it is encoded in a very simple way, with a% percent sign plus a two-bit character--0123456789abcdef-- Represents a byte of 16 in a form. The default character set used for URL encoding is us-ascii. For example A in the US-ASCII code of the corresponding byte is 0x61, then the URL is encoded after the%61, we enter the address bar, in fact, is equivalent to Google search ABC. Also as the @ symbol in the ASCII character set corresponding to the byte is 0x40, after the URL encoding is%40.

For non-ASCII characters, a superset of the ASCII character set is required to encode the corresponding byte, and then a percent code is executed for each byte. For Unicode characters, the RFC document recommends encoding the corresponding byte using Utf-8, and then performing a percent encoding for each byte. such as "Chinese" using the UTF-8 character set the byte is 0xe4 0xb8 0xAD 0xe6 0x96 0x87, after the URL code to get "%e4%b8%ad%e6%96%87".

If a byte corresponds to a reserved character in the ASCII character set, this byte does not need to be represented by a percent sign. For example, URL encoding, the byte encoded using UTF-8 encoding is 0x55 0x72 0x6c 0xe7 0xBC 0x96 0xe7 0xa0, because the first three bytes correspond to the non-reserved character "url" in ASCII, so these three bytes can be "url" with a non reserved character Said. The final URL code can be simplified to "url%e7%bc%96%e7%a0%81", of course, if you use "%55%72%6c%e7%bc%96%e7%a0%81" is also possible.

For historical reasons, there are some URL coding implementations that do not fully follow this principle, as will be mentioned below.

The difference between Escape,encodeuri and encodeuricomponent in JavaScript

In JavaScript, 3 pairs of functions are provided to encode URLs to obtain a valid URL, respectively, Escape/unescape,encodeuri/decodeuri and encodeURIComponent/ decodeURIComponent. Because the decoding and coding process is reversible, this only explains the coding process.

These three coded function--escape,encodeuri,encodeuricomponent--are used to convert unsafe and illegal URL characters to valid URL characters, which have several different points.

different security characters:

The security characters for these three functions are listed below (that is, the functions do not encode these characters)

Escape (69): */@+-._0-9a-za-z
encodeURI (82):!#$& ' () *+,/:;=?@-._~0-9a-za-z
encodeURIComponent (71):! ' () *-._~0-9a-za-z
Compatibility difference: Escape function is from the time of Javascript1.0, the other two functions are introduced in Javascript1.5. But since Javascript1.5 is already very popular, there is no compatibility problem with encodeURI and encodeuricomponent in practice.

Unicode characters are encoded differently: these three functions are encoded in the same way as ASCII characters and are represented by a percent + two-bit hexadecimal character. However, for Unicode characters, Escape is encoded by%UXXXX, where xxxx is a 4-bit hexadecimal character used to represent Unicode characters. This approach has been abandoned by the consortium. However, this encoding syntax for escape is still retained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters and then percent-coded. This is recommended by the RFC. Therefore, it is recommended that you use these two functions to encode as much as possible instead of escape.

Different applications: encodeURI is used to encode a complete URI, and encodeuricomponent is used to encode a component of the URI. From the Safe Character range table mentioned above, we will find that encodeuricomponent encodes a larger range of characters than encodeURI. As we mentioned above, reserved characters are generally used to separate the URI component (a URI can be sliced into multiple components, refer to a section of the preliminary knowledge) or a subassembly (such as a delimiter for query parameters in a URI), such as the number used to separate scheme and host, and the. Since the object being manipulated by encodeURI is a complete URI, these characters have special uses in the URI, so these reserved characters are not encoded by encodeURI, otherwise the meaning is changed.

A component has its own data representation format inside, but the data cannot contain reserved characters with separate components, or it can cause the separation of components in the entire URI to be confusing. So for a single component to use encodeURIComponent, you need to encode more characters.

form Submission

When an HTML form is submitted, each form field is encoded before being sent. For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for spaces is not% 20, but the + number, if the form is submitted using the Post method, we can see in the HTTP header that there is a content-type header, the value is application/ X-www-form-urlencoded. Most applications can handle this non-standard implementation of the URL encoding, but in the client JavaScript, there is no function can decode the + number of spaces, can only write the conversion function. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example we add the HTML header

<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "/>

The browser then uses GB2312 to render the document (note that when this meta tag is not set in an HTML document, the browser automatically selects the character set based on the current user's preferences, and the user can force the current Web site to use a specified character set). When submitting a form, the character set used by the URL encoding is gb2312.

Previously in the use of Aptana (why specifically referred to Aptana below) encountered a very confusing problem, that is, in the use of encodeURI, found that it encoded results and I think very different. Here's my sample code:

Copy Code code as follows:

<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http:// ">
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "/>
<script type= "Text/javascript" >
document.write (encodeURI ("Chinese"));

Run the result output%e6%b6%93%ee%85%9f%e6%9e%83. Obviously this is not the result of URL encoding using the UTF-8 character set (Google searches for "Chinese" and the URL shows%e4%b8%ad%e6%96%87).

So I was very skeptical at the time, encodeURI also related to page coding, but I found that, under normal circumstances, if you use gb2312 URL encoding will not get the result is. I finally found out that the page file store used the character set and meta tags specified in the character set inconsistency caused by the problem. The Aptana editor uses the UTF-8 character set by default. That is, the file is actually stored using the UTF-8 character set. However, because the META tag specifies gb2312, this time, the browser will follow gb2312 to parse the document, then naturally in the "Chinese" string here will be wrong, because the "Chinese" string UTF-8 encoded after the byte is 0xe4 0xb8 0xAD 0xe6 0x96 0x87, the 6 bytes are also browser gb2312 to decode, then you will get another three Chinese characters "Juan  Po" (GBK a Chinese character accounted for two bytes), the three Chinese characters in the encodeURI function after the result is%e6%b6%93%ee%85%9f% e6%9e%83. Therefore, encodeURI uses the UTF-8, and is not affected by the page character set.

Different browsers have different performance for handling problems that contain URLs in Chinese. For example, for IE, if you check the Advanced setting "Always send URL with UTF-8", then the path part of the URL in Chinese will be sent to the server using UTF-8 URL encoding, and the Chinese part of the query parameter uses the system default character set for URL encoding. For maximum interoperability, it is recommended that all components placed in the URL explicitly specify a character set for URL encoding, rather than relying on the default implementation of the browser.

In addition, many HTTP monitoring tools or browser address bar will automatically decode the URL once when the URL is displayed (using the UTF-8 character set), which is why when you visit Google search Chinese in Firefox, the address bar displays URLs that contain Chinese. But the original URL sent to the server is actually encoded. You can see it by using JavaScript on the address bar to access location.href. Don't be fooled by these illusions when researching URL codecs.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.