Special character escaping encoding in URL URLs

Source: Internet
Author: User
Tags control characters html header printable characters rfc

Special character escaping encoding in URL URLsDecember 29, 2017 09:10:57Hits:Special character escaping encoding in URL URLs

Character-URL encoded value

; -%3b
? -%3f
| -%7c

URL special character escapes, URLs in the special meaning of some characters, the basic encoding rules are as follows:

1. Replace the space with a plus sign (+)
2. Forward slash (/) separates directories and subdirectories
3. Question mark (?) Separating URLs and queries
4, percent semicolon (%) make special characters
5. #号指定书签
6, & Separation parameters

If you need to use it in a URL, you need to replace these special characters with the corresponding hexadecimal values
? %3f




In this paper, we mainly introduce the related problems of URI codec, and explain in detail which characters in URL coding need to be encoded, why they need to be encoded, and compare and analyze several pairs of functions of JavaScript and codec related Escape/unescape,encodeuri/ decodeURI and Encodeuricomponent/decodeuricomponent.


\_/  \______________/ \________/\_________/ \__/
|                        |                    |             | |
Scheme Authority path Query fragment

A URI is the meaning of a Uniform resource identifier, and usually the URL we call is just one of the URIs. The format of the typical URL is as shown above. The URL code mentioned below should actually refer to the URI encoding.

Why URL encoding is required

Usually if something needs to be coded, it means that something is not suitable for transmission. There are a variety of reasons, such as size too large to contain private data, and for URLs, the reason for encoding is because some characters in the URL cause ambiguity.

For example, the URL parameter string uses the Key=value key value pair in such a way that the key value pairs are separated by A & symbol, such as/s?q=abc& Ie=utf-8. If your value string contains = or &, then it is bound to cause the server parsing error to receive the URL, so the ambiguous & and = symbol must be escaped, that is, encoded.

Another example is that the URL is encoded in ASCII instead of Unicode, which means that you cannot include any non-ASCII characters in the URL, such as Chinese. Otherwise, Chinese can cause problems if the client browser and the server-side browser support different character sets.

The principle of URL encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent unsafe characters.

Which characters need to be encoded

The RFC3986 documentation stipulates that only English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL.

The RFC3986 document makes a detailed recommendation on the encoding and decoding of URLs, indicating which characters need to be encoded to not cause a change in URL semantics, and explain why these characters need to be encoded.

There are no printable characters in the us-ascii character set

Only printable characters are allowed in the URL. The 10-7f bytes in the US-ASCII code all represent control characters that do not appear directly in the URL. Also, for 80-ff bytes (iso-8859-1), the byte range defined by the US-ACII has been exceeded and therefore cannot be placed in the URL.

Reserved characters

URLs can be divided into several components, protocols, hosts, paths, and so on. There are some characters (:/?#[]@) that are used to separate different components. For example: colons are used to separate protocols and hosts,/for separating hosts and paths, for separating paths and query parameters, and so on. There are also characters (!$& ' () *+,;=) that are used to delimit each component, such as = used to represent key-value pairs in query parameters,& symbols are used to separate queries for multiple key-value pairs. When normal data in a component contains these special characters, it needs to be encoded.

The following characters are reserved characters in RFC3986:

! * ( ) ; : @ & = + $ , / ? # [ ]
Unsafe characters

There are also some characters that may cause ambiguity in the parser when they are placed directly in the URL. These characters are considered unsafe characters for a number of reasons.

Space URL in the process of transmission, or the user in the process of typesetting, or text handlers in the process of processing URLs, it is possible to introduce insignificant spaces, or the meaningful spaces to remove
quotation marks and <> quotation marks and angle brackets are commonly used to delimit URLs in plain text
# Typically used to represent bookmarks or anchor points
% The percent semicolon itself is used as a special character to encode unsafe characters, so it needs to be encoded
{}|\^[]`~ Some gateways or transport agents will tamper with these characters

It is important to note that for legitimate characters in URLs, encoding and non-coding are equivalent, but for the above mentioned characters, they may cause different URL semantics if they are not encoded. Therefore, for URLs, only ordinary English characters and numbers, special character $-_.+!* ' () and reserved characters, can appear in the URL without encoding. All other characters need to be encoded before they appear in the URL.

However, due to historical reasons, there are still some non-standard coding implementations. For example, the ~ symbol, although the RFC3986 document stipulates that for the wave symbol ~, does not require URL encoding, but there are still many old gateways or transport agents will

How to encode an illegal character in a URL

URL encoding is also commonly referred to as a percent-encoding (URL encoding,also known as percent-encoding) because it is encoded in a very simple way, using the% percent sign plus the two-bit character--0123456789abcdef-- Represents a 16 binary form of a byte. The default character set used by URL encoding is US-ASCII. For example A in the US-ASCII code in the corresponding byte is 0x61, then the URL encoding is%61, we enter http://g.cn/search?q=%61%62%63 on the address bar, in fact, the equivalent of searching for ABC on google. Another example of the @ symbol in the ASCII character set of the corresponding byte is 0x40, after the URL encoded by the%40.

List of URL encodings for common characters:

URL encoding of reserved characterstd>?
! * " ' ( ) ; : @ &
%21 %2a %22 %27 %28 %29 %3b %3a % 40 %26
= + $ , / % # [ ]
%3d %2b %24 %2c %2f %3f %25 %23 %5b %5d

For non-ASCII characters, a superset of the ASCII character set is required to encode the corresponding bytes, and then the percent code is executed for each byte. For Unicode characters, the RFC document recommends using UTF-8 to encode the corresponding bytes, and then perform a percent-encoding on each byte. For example, "Chinese" uses the UTF-8 character set to get the byte 0xe4 0xb8 0xAD 0xe6 0x96 0x87, after URL encoding to get "%e4%b8%ad%e6%96%87".

If a byte corresponds to a non-reserved character in the ASCII character set, this byte does not need to be represented by a percent sign. For example, "URL encoding", the bytes obtained using UTF-8 encoding is 0x55 0x72 0x6c 0xE7 0xBC 0x96 0xE7 0xA0 0x81, because the first three bytes correspond to the non-reserved character "url" in ASCII, so these three bytes can be used non-reserved character "url" Said. The final URL encoding can be simplified to "url%e7%bc%96%e7%a0%81", of course, if you use "%55%72%6c%e7%bc%96%e7%a0%81" is also possible.

For historical reasons, there are some URL encoding implementations that do not fully follow this principle, as mentioned below.

The difference between Escape,encodeuri and encodeuricomponent in JavaScript

JavaScript provides 3 pairs of functions used to encode URLs to get a valid URL, respectively, Escape/unescape,encodeuri/decodeuri and encodeURIComponent/ decodeURIComponent. Since the decoding and encoding process is reversible, the encoding process is only explained here.

These three coded function--escape,encodeuri,encodeuricomponent--are used to convert unsafe, illegal URL characters to legitimate URL characters, and they have several differences.

Different security characters

The following table lists the security characters for these three functions (that is, the function does not encode these characters)

Safe characters
Escape (69) */@+-._0-9a-za-z
encodeURI (82) !#$& ' () *+,/:; [Email protected]_~0-9a-za-z
encodeURIComponent (71) !‘ () *-._~0-9a-za-z
Different compatibility

The escape function existed from the time of Javascript1.0, and the other two functions were introduced in Javascript1.5. But since Javascript1.5 is already very popular, there is no compatibility problem with encodeURI and encodeuricomponent in practice.

Unicode characters are encoded differently

These three functions are encoded in the same way as ASCII characters, and are denoted by a percent + two-bit hexadecimal character. However, for Unicode characters, Escape is encoded in%uxxxx, where xxxx is the 4-bit hexadecimal character used to represent Unicode characters. This approach has been abandoned by the "the". However, this encoding syntax for escape is still maintained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters before they are percent-encoded. This is the RFC recommendation. It is therefore advisable to use these two functions instead of escape for encoding whenever possible.

Suitable for different occasions

encodeURI is used to encode a complete URI, while encodeURIComponent is used as a component of the URI.

Judging from the list of safe character ranges mentioned above, we will find that the encodeURIComponent encodes a larger range of characters than encodeURI. As we mentioned above, reserved characters are generally used to separate the URI component (a URI can be cut into multiple components, refer to the Preparatory Knowledge section) or a subcomponent (such as a delimiter for query parameters in a URI), such as: number used to separate scheme and host, the number is used to separate the host and path. Since the object manipulated by encodeURI is a complete URI, these characters are inherently special in URIs, so these reserved characters are not encoded by encodeURI, otherwise the meaning is changed.

The component has its own data representation format, but the data inside cannot contain reserved characters with delimited components, otherwise it will cause the separation of components in the entire URI to be confusing. Therefore, for a single component to use encodeURIComponent, you need to encode more characters.

Form submission

When an HTML form is submitted, each form field is URL-encoded before it is sent. For historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for whitespace is not% 20, but the + number, if the form is submitted using the Post method, we can see a content-type header in the HTTP header with a value of application/ X-www-form-urlencoded. Most applications can handle this non-standard implementation of URL encoding, but in client JavaScript, there is no function to decode the + number into a space, only write the conversion function itself. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example, we add the HTML header

<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "/>

The browser will then use GB2312 to render the document (note that when the META tag is not set in the HTML document, the browser automatically chooses the character set based on the current user preferences, and the user can also force the current site to use a specified character set). When a form is submitted, the character set used by the URL encoding is gb2312.

Does the document character set affect encodeURI?

Before using Aptana (why specifically referred to as Aptana below) encountered a very confusing problem, that is, when using encodeURI, found that it encoded the results and I think very different. Here is my sample code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">     <head>         <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />     </head>     <body>         <script type="text/javascript">             document.write(encodeURI("中文"));         </script>     </body> </html>  

The output of the running result is% E6% B6% 93% EE% 85% 9F% E6% 9E% 83. Obviously this is not the result of Url encoding using the UTF-8 character set (search for "Chinese" on Google, Url shows% E4% B8% AD% E6% 96% 87).

So I was very skeptical at the time, wasn't encodeURI related to page encoding, but I found that under normal circumstances, if you use gb2312 for URL encoding, you will not get this result. Later, I finally discovered that it turned out to be a problem caused by the inconsistency between the character set used in the page file storage and the character set specified in the Meta tag. Aptana's editor uses the UTF-8 character set by default. In other words, the file is actually stored in the UTF-8 character set. But because gb2312 is specified in the Meta tag, at this time, the browser will parse this document according to gb2312, so naturally the Chinese character string will be wrong here, because the "Chinese" character string is obtained after UTF-8 encoding. The bytes are 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and these 6 bytes are decoded by the browser using gb2312, then you will get another three Chinese characters "tweets" (one Chinese character in GBK takes two bytes) After the three Chinese characters are passed into the encodeURI function, the result is% E6% B6% 93% EE% 85% 9F% E6% 9E% 83. Therefore, encodeURI uses UTF-8 and is not affected by the page character set.

Other issues related to Url encoding
Different browsers have different performances when dealing with Url containing Chinese. For example, for IE, if you check the advanced setting "Always send Url in UTF-8", then the Chinese in the path part of the Url will be encoded with UTF-8 and sent to the server, and the Chinese in the query parameters Some use the system default character set for Url encoding. To ensure maximum interoperability, it is recommended that all components placed in Url explicitly specify a certain character set for Url encoding without relying on the browser's default implementation.

In addition, many HTTP monitoring tools or browser address bars will automatically decode the URL once it is displayed (using the UTF-8 character set). This is why when you access Google and search Chinese in Firefox, the address bar The displayed Url contains Chinese sake. However, the original Url sent to the server is actually encoded. You can see the location.href using Javascript on the address bar. Don't be fooled by these artifacts when studying Url codecs.


Escape encoding of special characters in URLs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.