Reprint: Why to encode a URI

Last Update:2015-07-12 Source: Internet

Author: User

Tags control characters printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why URL encoding is required, usually if something needs to be coded, it means that something is not suitable for transmission. There are a variety of reasons, such as size too large to contain private data, and for URLs, the reason for encoding is because some characters in the URL cause ambiguity. For example, the URL parameter string uses the Key=value key value pair in such a way to pass the parameter, and the key-value pairs are separated by A & symbol, such as/s?q=abc&ie=utf-8. If your value string contains = or &, then it is bound to cause the server parsing error to receive the URL, so the ambiguous & and = symbol must be escaped, that is, encoded. Another example is that the URL is encoded in ASCII instead of Unicode, which means that you cannot include any non-ASCII characters in the URL, such as Chinese. Otherwise, Chinese can cause problems if the client browser and the server-side browser support different character sets.

The principle of URL encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent unsafe characters. Which characters need to be encoded RFC3986 documents stipulate that only the English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL. The RFC3986 document makes a detailed recommendation on the encoding and decoding of URLs, indicating which characters need to be encoded to not cause a change in URL semantics, and explain why these characters need to be encoded. The us-ascii character set does not have a corresponding printable character in the URL that only allows printable characters to be used. The 10-7f bytes in the US-ASCII code all represent control characters that do not appear directly in the URL. Also, for 80-ff bytes (iso-8859-1), the byte range defined by the US-ACII has been exceeded and therefore cannot be placed in the URL.

Reserved character URLs can be divided into several components, protocols, hosts, paths, and so on. There are some characters (:/?#[]@) that are used to separate different components. For example: colons are used to separate protocols and hosts,/for separating hosts and paths, for separating paths and query parameters, and so on. There are also characters (!$& ' () *+,;=) that are used to delimit each component, such as = used to represent key-value pairs in query parameters,& symbols are used to separate queries for multiple key-value pairs. When normal data in a component contains these special characters, it needs to be encoded.

The following characters are reserved characters in RFC3986:! * ‘ ( ) ; : @ & = + $,/? # [] Unsafe characters have some characters that can cause ambiguity in the parser when they are placed directly in the URL. These characters are considered unsafe characters for a number of reasons. The space URL in the process of transmission, or the user in the process of typesetting, or text handlers in the process of processing URLs, it is possible to introduce insignificant spaces, or the meaningful spaces to remove the quotation marks and the <> quotation marks and angle brackets are usually used in ordinary text to separate the role of the URL # Typically used to represent a bookmark or an anchor% percent sign itself used as a special character to encode unsafe characters, so it needs to encode {}|\^[] ' ~ Some gateways or transport agents will tamper with these characters

It is important to note that for legitimate characters in URLs, encoding and non-coding are equivalent, but for the above mentioned characters, they may cause different URL semantics if they are not encoded. Therefore, for URLs, only ordinary English characters and numbers, special character $-_.+!* ' () and reserved characters, can appear in the URL without encoding. All other characters need to be encoded before they appear in the URL.

JavaScript provides 3 pairs of functions used to encode URLs to get a valid URL, respectively, Escape/unescape,encodeuri/decodeuri and encodeURIComponent/ decodeURIComponent. Since the decoding and encoding process is reversible, the encoding process is only explained here. These three coded function ――escape,encodeuri,encodeuricomponent―― are used to convert unsafe, illegal URL characters to legitimate URL characters, and they have several differences.

Security characters are different The following table lists the security characters for these three functions (that is, the function does not encode these characters) safe character escape (69) */@+-._0-9a-za-z encodeURI (82)!#$& ' () *+,/:; [Email protected]_~0-9a-za-z encodeuricomponent (71)! ' () *-._~0-9a-za-z compatibility different escape functions were present from the time of Javascript1.0, and the other two functions were introduced in Javascript1.5. But since Javascript1.5 is already very popular, there is no compatibility problem with encodeURI and encodeuricomponent in practice.

Unicode characters are encoded differently these three functions are encoded in the same way as ASCII characters, and are represented by a percent + two-bit hexadecimal character. However, for Unicode characters, Escape is encoded in%uxxxx, where xxxx is the 4-bit hexadecimal character used to represent Unicode characters. This approach has been abandoned by the "the". However, this encoding syntax for escape is still maintained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters before they are percent-encoded. This is the RFC recommendation. It is therefore advisable to use these two functions instead of escape for encoding whenever possible.

Applicable occasions different encodeURI are used to encode a complete URI, while encodeURIComponent is used as a component of the URI. Judging from the list of safe character ranges mentioned above, we will find that the encodeURIComponent encodes a larger range of characters than encodeURI. As we mentioned above, reserved characters are generally used to separate the URI component (a URI can be cut into multiple components, refer to the Preparatory Knowledge section) or a subcomponent (such as a delimiter for query parameters in a URI), such as: number used to separate scheme and host, the number is used to separate the host and path. Since the object manipulated by encodeURI is a complete URI, these characters are inherently special in URIs, so these reserved characters are not encoded by encodeURI, otherwise the meaning is changed. The component has its own data representation format, but the data inside cannot contain reserved characters with delimited components, otherwise it will cause the separation of components in the entire URI to be confusing. Therefore, for a single component to use encodeURIComponent, you need to encode more characters.

Reprint: Why to encode a URI

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More