Detailed description of URL encoding/Decoding in Javascript

Source: Internet
Author: User
Tags number sign printable characters
ArticleDirectory
    • No printable characters in the US-ASCII Character Set
    • Reserved characters
    • Insecure characters
    • Different security characters
    • Different compatibility
    • Unicode characters are encoded in different ways.
    • Different application scenarios
Summary

This article mainly introduces the problems related to Uri encoding and decoding, and describes in detail what characters need to be encoded and why they need to be encoded, the functions escape/Unescape, encodeuri/decodeuri and encodeuricomponent/decodeuricomponent related to Codec in JavaScript are compared and analyzed.

Prerequisites

Foo: // example.com: 8042/over/there? Name = ferret # nose
\_/\______________/\________/\_________/\__/
|
Scheme Authority path query Fragment

Uri indicates the Uniform Resource Identifier. Generally, the URL is only a type of Uri. The format of a typical URL is shown above. The URL encoding mentioned below actually refers to Uri encoding.

Why URL encoding?

If an encoding is required, it indicates that this is not suitable for transmission. There are many reasons, such as the large size, including private data,For the URL, encoding is performed because some characters in the URL may cause ambiguity..

For example, if the URL parameter string uses the key = value Key-value pair to pass the parameter, key-value pairs are separated by the & symbol, such as/s? Q = ABC & Ie = UTF-8. If your value string contains = or &, it will inevitably cause parsing errors on the server that receives the URL. Therefore, you must escape the ambiguous & and = symbols, that is, encode it.

For example, the URL encoding format uses ASCII code rather than Unicode, which means you cannot include any non-ASCII characters in the URL, such as Chinese characters. Otherwise, if the character set supported by the client browser and the Server Browser is different, Chinese may cause problems.

The URL encoding principle is to use secure characters (printable characters without special purposes or special meanings) to indicate insecure characters.

Characters to be encoded

Rfc3986 documents stipulate that the URL can only contain letters (A-Za-z), numbers (0-9), and ),-_.~ 4 special characters and all reserved characters.

Rfc3986 provides detailed recommendations on URL encoding and decoding, and points out which characters need to be encoded to avoid URL semantic changes, and explains why these characters need to be encoded.

No printable characters in the US-ASCII Character Set

Only printable characters are allowed in the URL. The 10-7f bytes in the US-ASCII code all represent control characters that cannot appear directly in the URL. At the same time, for 80-FF bytes (ISO-8859-1), it is not allowed to be placed in a URL because it is beyond the byte range defined by the US-ACII.

Reserved characters

The URL can be divided into several components, such as the Protocol, host, and path. There are some characters (:/? # [] @) Is used to separate different components. For example, a colon is used to separate the protocol and host, And/is used to separate the host and path ,? Used to separate paths and query parameters. There are also some characters (! $ & '() * +,; =) Is used to separate each component. For example, = is used to represent the key-value pair in the query parameter, the & symbol is used to separate multiple key-value pairs. When common data in a component contains these special characters, it must be encoded.

Rfc3986 specifies the following characters as reserved characters:

! * ' ( ) ; : @ & = + $ , / ? # [ ]
Insecure characters

There are also some characters that may cause parsing when they are directly placed in the URL.Program. These characters are considered unsafe for many reasons.

Space During the URL transmission process, the user's typographical process, or the Text Processing Program's URL Processing Process, there may be irrelevant spaces or meaningful spaces to be removed.
Quotation marks and <> Quotation marks and angle brackets are usually used to separate URLs in common text.
# Used to indicate bookmarks or anchor points
% Percent signs are special characters used to encode unsafe characters. Therefore, they must be encoded.
{}| \ ^ [] '~ Some gateways or transport proxies tamper with these characters.

 

It should be noted that for valid characters in the URL, encoding and non-encoding are equivalent, but for those characters mentioned above, if they are not encoded, then they may cause different URL semantics. Therefore, for URLs, only common English characters and numbers are supported. special characters include $-_. +! * '() And reserved characters can appear in unencoded URLs.. Other characters must be encoded before they can appear in the URL.

However, due to historical reasons, there are still some nonstandard coding implementations. For example ~ Symbol, although rfc3986 documents stipulate that ~, There is no need for URL encoding, but there are still many old gateways or transport proxies

How to encode invalid characters in a URL

URL encoding is also known as percentage code (URL encoding, also known as percent-encoding) because it is very simple in encoding mode, use the % percent sign plus two characters -- 0123456789abcdef -- to represent the hexadecimal format of a byte. The default Character Set of URL encoding is US-ASCII. For example, if the byte of A in the US-ASCII code is 0x61, then what we get after URL encoding is % 61, we enter the http://g.cn/search in the address bar? Q = % 61% 62% 63 is actually equivalent to searching ABC on Google. For example, if the byte of the @ symbol in the ASCII character set is 0x40, % 40 is obtained after URL encoding.

URL encoding list of common characters:

Reserved Character URL Encoding
! * " ' ( ) ; : @ &
% 21 % 2a % 22 % 27 % 28 % 29 % 3B % 3A % 40 % 26
= + $ , / ? % # [ ]
% 3d % 2B % 24 % 2C % 2f % 3f % 25 % 23 % 5B % 5d

For non-ASCII characters, the super set of the ASCII character set must be used for encoding to obtain the corresponding bytes, and then each byte must be percent encoded.. For Unicode characters, we recommend that you use UTF-8 to encode them to obtain the corresponding bytes, and then perform percent encoding for each byte. For example, "Chinese" uses the UTF-8 character set to get the byte 0xe4 0xb8 0xad 0xe6 0x96 0x87, after URL encoding, "% E4 % B8 % ad % E6 % 96% 87" is obtained ".

If a byte corresponds to a non-reserved character in the ASCII character set, this Byte does not need to be expressed by a percent sign.. For example, "url encoding", the bytes encoded using the UTF-8 are 0x55 0x72 0x6c 0xe7 0xbc 0x96 0xe7 0xa0 0x81, because the first three bytes correspond to the non-reserved character "url" in ASCII, these three bytes can be expressed by non-reserved character "url. The final URL encoding can be simplified to "url % E7 % BC % 96% E7 % A0 % 81". Of course, you can also use "% 55% 72% 6C % E7 % BC % 96% E7 % A0 % 81.

Due to historical reasons, some URL encoding implementations do not fully follow this principle, which will be mentioned below.

Differences between escape, encodeuri, and encodeuricomponent in Javascript

Javascript provides three functions for URL encoding to obtain valid urls: escape/Unescape, encodeuri/decodeuri, and encodeuricomponent/decodeuricomponent. Since the decoding and encoding processes are reversible, only the encoding processes are described here.

The three encoded functions-escape, encodeuri, and encodeuricomponent-are used to convert insecure and invalid URL characters into valid URL characters. They have the following differences.

Different security characters

The following table lists the safe characters of these three functions (that is, the function does not encode these characters)

  Security characters
Escape (69) */@ +-. _ 0-9a-za-z
Encodeuri (82) ! # $ & '() * +,/:; =? @-._~ 0-9a-za-z
Encodeuricomponent (71) ! '()*-._~ 0-9a-za-z
Different compatibility

The escape function exists from javascript1.0. The other two functions are introduced in javascript1.5. However, because javascript1.5 is already very popular, there is no compatibility problem when using encodeuri and encodeuricomponent.

Unicode characters are encoded in different ways.

The three functions are encoded in the same way as ASCII characters, both of which are expressed by percent signs + two hexadecimal characters. However, for Unicode characters, the escape encoding method is % u.XxxxXXXX is a four-digit hexadecimal character used to represent Unicode characters. This method has been abandoned by W3C. But this encoding syntax for escape remains in the ECMA-262 standard.Encodeuri and encodeuricomponent encode non-ASCII characters using the UTF-8 before percent Encoding. This is recommended by RFC. Therefore, we recommend that you use these two functions instead of escape for encoding.

Different application scenarios

Encodeuri is used to encode a complete Uri, while encodeuricomponent is used to encode a component of the URI.

From the security character range table above, we can see that the character range of encodeuricomponent encoding is larger than that of encodeuri. As mentioned above, reserved characters are generally used to separate URI components (a URI can be cut into multiple components. For details, refer to the preparations section) or sub-components (such as delimiters of query parameters in URI), such as: used to separate scheme and host ,? Number is used to separate the host and path. Since the object operated by encodeuri is a complete Uri, these characters have special purposes in the URI, so these reserved characters are not encoded by encodeuri, otherwise the meaning changes.

The component has its own data representation format, but the data cannot contain the reserved characters of the separator component. Otherwise, the component separation in the entire URI is disordered. Therefore, for a single component to use encodeuricomponent, more characters need to be encoded.

Form submission

When an HTML form is submitted, each form field is URL encoded before being sent. Due to historical reasons, the URL encoding Implementation of the form does not comply with the latest standards. For example, the space encoding is not "% 20", but "+". If the form is submitted using the POST method, we can see a Content-Type header in the HTTP header, the value is application/X-WWW-form-urlencoded. Most applications can handle this non-standard implementation of URL encoding. However, in the client JavaScript, no function can decode the "+" into a space and write the Conversion Function by itself. Also, for non-ASCII characters, the encoding character set used depends on the character set used in the current document. For example, we add

 
<Meta HTTP-equiv= "Content-Type" Content= "Text/html; charset = gb2312" />

In this way, the browser will use gb2312 to render this document (NOTE: When this meta tag is not set in the HTML document, the browser will automatically select the character set according to the current user's preferences, you can also force the current website to use a specified character set ). When a form is submitted, the character set used for URL encoding is gb2312.

Does the document character set affect encodeuri?

I encountered a very confusing problem when I used Aptana (why I mentioned it below), that is, when I used encodeuri, I found that the encoded results were very different from what I thought. Below is my exampleCode:

 <! Doctype   Html   Public   "-// W3C // dtd xhtml 1.0 transitional // en"   Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"  >  <  Html   Xmlns  = "Http://www.w3.org/1999/xhtml"  >      <  Head  >          <  Meta   HTTP-equiv  = "Content-Type"   Content = "Text/html; charset = gb2312"   />      </  Head  >      <  Body  >          <  Script   Type  = "Text/JavaScript"  > Document. Write (encodeuri ( "Chinese" )); </  Script  >      </  Body >  </  Html  >   

Run result output % E6 % B6 % 93% ee % 85% 9f % E6 % 9e % 83. Obviously this is not the result of URL encoding using the UTF-8 character set (search for "Chinese" on Google and the URL displays % E4 % B8 % ad % E6 % 96% 87 ).

So I was very skeptical at the time. Is encodeuri still related to page encoding, but I found that under normal circumstances, if you use gb2312 for URL encoding, you will not get this result. I finally found out thatProblems caused by inconsistent character sets used by page file storage and specified character sets in Meta Tags. The editor of Aptana uses the UTF-8 character set by default. That is to say, this file is actually stored using the UTF-8 character set. However, because gb2312 is specified in the meta tag, the browser will parse this document according to gb2312, and errors will naturally occur in the "Chinese" string, because the byte produced after the "Chinese" string is encoded with a UTF-8 is 0xe4 0xb8 0xad 0xe6 0x96 0x87, the six byte is decoded by the browser with gb2312, then we will get the other three Chinese characters "Juan" (one Chinese Character occupies two bytes in GBK ), after the three Chinese characters are passed into the encodeuri function, the result is % E6 % B6 % 93% ee % 85% 9f % E6 % 9e % 83. As a result, encodeuri uses a UTF-8 and is not affected by the page character set.

Other issues related to url Encoding

Different browsers have different processing problems for URLs containing Chinese characters. For example, for IE, if you check Advanced Settings "always send URL with UTF-8", then the Chinese part of the path in the URL will be URL encoded using the UTF-8 and sent to the server, the Chinese part of the query parameters uses the default character set for URL encoding. To ensure maximum interoperability, it is recommended that all components put into the URL explicitly specify a character set for URL encoding, without relying on the browser's default implementation.

In addition, many HTTP monitoring tools or browser address bar will automatically decode the URL once when displaying the URL (using the UTF-8 Character Set ), this is why the URL displayed in the address bar contains Chinese characters when you access Google to search for Chinese Characters in Firefox. The original URL actually sent to the server is encoded. You can use JavaScript to access location. href in the address bar. Do not be confused by these illusions when studying URL codec.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.