URL encoding and decoding

Source: Internet
Author: User
Tags control characters html form html header parse error printable characters rfc

It is usually assumed that something needs to be coded to indicate that such a thing is not suitable for transmission. There are a variety of reasons, such as size too large. including privacy data, for URLs, the reason for encoding is because some characters in the URL will cause ambiguity.

Like what. The URL parameter string uses the Key=value key value for this form to be passed, and the key-value pairs are separated by A & symbol, such as/s?q=abc&ie=utf-8.

Assuming that your value string includes = or &, it is bound to cause a server parse error to receive the URL. Therefore, the ambiguous & and = symbols must be escaped. is to encode it.

Also, the URL encoding format uses ASCII code. Instead of Unicode, which means you can't include any non-ASCII characters, such as Chinese, in the URL.

Otherwise, if the client browser and the server browser support a different character set, Chinese may cause problems.

The principle of URL encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent unsafe characters.

Pre-Knowledge: URI is the meaning of a uniform resource identifier, usually the URL we call is just one of the URIs.

The format of a typical URL is seen in the following example. The URL code mentioned below should actually refer to the URI encoding.

foo://example.com:8042/over/there?

name=ferret#nose

   \_/ \______________/ \________/\_________/ \__/

    |         |              |         |        |

Scheme Authority path Query fragment

   which characters need to be encoded

RFC3986 documentation provisions. The URL only agrees to include the English alphabet (a-za-z), Number (0-9),-_.~4 special characters, and all reserved characters.

The RFC3986 document makes specific suggestions on the encoding and decoding of URLs, indicating which characters need to be encoded to not cause a change in URL semantics, and explain why these characters need to be encoded.

There are no corresponding printable characters in the us-ascii character set: Only printable characters are agreed to in the URL.

The 10-7f bytes in the US-ASCII code all represent control characters that cannot be directly present in the URL. At the same time, for 80-ff bytes (iso-8859-1), because the byte range defined by US-ACII has been exceeded. Therefore, it cannot be placed in the URL.

Reserved characters: URLs can be divided into several components. Protocols, hosts, paths, and so on. There are some characters (:/?

#[]@) is used to separate different components. For example: colons are used to separate protocols and hosts,/to separate hosts and paths. For separating paths and query parameters, and so on.

Other characters (!$& ' () *+,;=) are used to delimit each component, such as = used to represent key-value pairs in query parameters. The & symbol is used to separate queries for multiple key-value pairs. When normal data in a component includes these special characters. It needs to be encoded.

The following characters are reserved characters in RFC3986:! * ‘ ( ) ; : @ & = + $,/? # [ ]

Unsafe characters: some other characters. When they are placed directly in the URL, it may cause ambiguity in the parser.

These characters are treated as unsafe characters. There are very many reasons.

    • Spaces: The process by which the URL is in transit, or when the user is in the process of typesetting, or when the text handler processes the URL. It is possible to introduce insignificant spaces, or to remove those meaningful spaces.
    • and <>: the argument and the angle bracket pass are often used to act as separate URLs in normal text
    • #: Often used to represent bookmarks or anchor points
    • %: The percent semicolon itself is used as a special character to encode unsafe characters. So the code itself needs to be
    • {}|\^[] ' ~: Some gateways or transport agents will tamper with these characters

It is important to note that for legitimate characters in the URL. Encoding and non-coding are equivalent. However, for these characters mentioned above, assuming that they are not encoded, they may cause different URL semantics.

Therefore, for the URL, there is only ordinary English characters and numbers, the special character $-_.+!* ' () and reserved characters, talent out of today's non-encoded URLs. All other characters need to be encoded to be present in the URL.

But for historical reasons. There are still some non-standard coding implementations. For example, for the ~ symbol, although the RFC3986 document stipulates that for the wave symbol ~, there is no need for URL encoding, but there are still very many old gateways or transport agents will encode.

How to encode an illegal character in a URL

URL encoding is also commonly referred to as the Percent-encoding (URL encoding,also known as percent-encoding), because it is very easy to encode, using the% percent sign plus two-bit characters--0123456789abcdef-- Represents a 16 binary form of a byte.

The default character set used by URL encoding is US-ASCII. For example a in US-ASCII code in the corresponding byte is 0x61. Then the URL code to get is%61. We enter http://g.cn/search?q=%61%62%63 on the address bar, which is actually equivalent to searching for ABC on google. Another example of the @ symbol in the ASCII character set of the corresponding byte is 0x40, after the URL encoded by the%40.

For non-ASCII characters. You need to encode the corresponding bytes using a superset of the ASCII character set, and then run the percent encoding for each byte.

For Unicode characters, the RFC document recommends using UTF-8 to encode the corresponding bytes, and then run the percent encoding for each byte. For example, "Chinese" uses the UTF-8 character set to get the byte 0xe4 0xb8 0xAD 0xe6 0x96 0x87, after URL encoding to get "%e4%b8%ad%e6%96%87".

Suppose a byte corresponds to a non-reserved character in the ASCII character set. This byte does not need to use a percent sign. For example, "URL encoding", the bytes obtained using UTF-8 encoding is 0x55 0x72 0x6c 0xE7 0xBC 0x96 0xE7 0xA0 0x81, because the first three bytes corresponding to the non-reserved character "url" in ASCII, so these three bytes can be used non-reserved character "url" Said. Finally the URL code can be simplified to "url%e7%bc%96%e7%a0%81", of course, assuming you use "%55%72%6c%e7%bc%96%e7%a0%81" is also able.

For historical reasons, some URL encoding implementations do not fully follow this principle, as mentioned below.

The difference between escape, encodeURI and encodeuricomponent in JavaScript

JavaScript provides 3 pairs of functions used to encode URLs to get a valid URL, each of which is Escape/unescape, Encodeuri/decodeuri, and encodeURIComponent/ decodeURIComponent.

Because the process of decoding and encoding is reversible. So here we just explain the coding process.

These three coded functions are--escape,encodeuri. encodeuricomponent--are used to convert unsafe, illegal URL characters to legitimate URL characters, and they have several different points.

Security characters are different:

The following is a list of the security characters for these three functions (that is, the function does not encode these characters)

    • Escape (69): */@+-._0-9a-za-z
    • encodeURI (82):!#$& ' () *+,/:;=?

      @-._~0-9a-za-z

    • encodeURIComponent (71):! ' () *-._~0-9a-za-z

Compatibility: The escape function existed from JavaScript 1.0, and the other two functions were introduced in JavaScript 1.5. But because JavaScript 1.5 is already popular. So there's really no compatibility problem with encodeURI and encodeuricomponent.

Unicode characters are encoded differently: these three functions are encoded as ASCII characters in the same way, using a percent + two-bit hexadecimal character. However, for Unicode characters, Escape is encoded in%uxxxx, where xxxx is the 4-bit hexadecimal character used to represent Unicode characters. Such a way has been abandoned by the information. However, the code syntax for escape is still maintained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters. Then the percent code is followed.

This is the RFC recommendation. It is therefore advisable to use these two functions instead of escape for encoding whenever possible.

The application is different: encodeURI is used to encode a complete URI, and encodeURIComponent is used as a component of the URI. Judging from the list of safe character ranges mentioned above, we will find that the encodeURIComponent encodes a larger range of characters than encodeURI.

As we mentioned above, reserved characters are usually used to separate the URI component (a URI can be split into multiple components, a reference to the Preparatory Knowledge section) or a child component (such as a delimiter in a URI that queries the number of parameters). Such as: number used to separate scheme and host. The number is used to separate the host and path. Because the object manipulated by encodeURI is a complete URI, these characters are inherently special in URIs, so these reserved characters are not encoded by encodeURI. Otherwise the meaning will change.

The component has its own data representation format, but the data inside cannot include reserved characters that have separate components, which can cause the separation of components in the entire URI to be confusing. Therefore, for a single component to use encodeURIComponent, there are many other characters that need to be encoded.

Form submission

When an HTML form is submitted, each form field is URL-encoded before it is sent. For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for spaces is not% 20. But the + number. Assuming that the form is committed using the Post method, we can see a content-type header in the HTTP header. The value is application/x-www-form-urlencoded. Most applications can handle this non-standard implementation of URL encoding, but in Clientjavascript. There is no function to decode the + number into a space. You can simply write your own conversion function. Also, for non-ASCII characters. The encoding character set used depends on the character set used by the current document.

For example, we add the HTML header

<http-equiv= "Content-type"  Content= "text/html; charset=gb2312"  />

The browser will then use GB2312 to render the document (note that when the META tag is not set in the HTML document, the browser will voluntarily select the character set according to the current user preferences, and the user can also force the current site to use a specified character set).

When a form is submitted, the character set used by the URL encoding is gb2312.

Before using Aptana (why specifically referred to as Aptana below) encountered a very confusing problem, that is, when using encodeURI, it is found that the coding results and I think very different. Here is a sample code for my demo:

<!DOCTYPE HTML Public "-//W3C//DTD XHTML 1.0 transitional//en" "http://www.w3.org/TR/xhtml1/DTD/ Xhtml1-transitional.dtd "><HTMLxmlns= "http://www.w3.org/1999/xhtml">    <Head>        <Metahttp-equiv= "Content-type"content= "text/html; charset=gb2312" />    </Head>    <Body>        <Scripttype= "Text/javascript">document.write (encodeURI ("English")); </Script>    </Body></HTML>

Executes the result output%e6%b6%93%ee%85%9f%e6%9e%83. Obviously this is not the result of URL encoding using the UTF-8 character set (search for "Chinese" on Google.) %e4%b8%ad%e6%96%87 is displayed in the URL).

So I was very skeptical at the time, is encodeURI also related to the page encoding, but I found that normal situation. Suppose you use gb2312 for URL coding and you don't get this result. I eventually found out that the page file store used the character set and the meta tags specified in the character set inconsistency caused the problem.

The Aptana editor uses the UTF-8 character set by default. This means that the file is actually stored using the UTF-8 character set. However, because gb2312 is specified in the META tag. At this time, the browser will follow the gb2312 to parse the document, then naturally in the "Chinese" this string will be wrong, because the "Chinese" string with UTF-8 encoding after the resulting byte is 0xe4 0xb8 0xAD 0xe6 0x96 0x87, These 6 bytes are gb2312 by the browser to decode, then you will get another three characters "Juan Po" (GBK in a Chinese character accounted for two bytes). The result of these three characters after passing in the encodeURI function is%e6%b6%93%ee%85%9f%e6%9e%83. So. encodeURI or UTF-8 is not affected by the page character set.

For handling problems with URLs that include Chinese. Different browsers have different performance. For example, ie. Suppose you tick the advanced settings "always send URLs in UTF-8", then the Chinese portion of the path portion of the URL is sent to the server using UTF-8 URL encoding, while the Chinese part of the query reference uses the system default character set for URL encoding. To ensure maximum interoperability, it is recommended that all components placed in the URL explicitly specify a character set for URL encoding, rather than relying on the default implementation of the browser.

In addition, a very high number of HTTP monitoring tools, such as the browser address bar, and so on when the URL is displayed, they will actively decode the URL (using the UTF-8 character set). That's why when you visit Google in Firefox to search for Chinese. The URL displayed in the address bar includes the Chinese language. But the original URL that is actually sent to the server is still encoded.

You can see the location.href by using JavaScript on the address bar.

Don't be fooled by these illusions when researching URL codecs.

URL encoding and decoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.