Percent-encode percent code, encodeuri percent

Source: Internet
Author: User
Tags number sign printable characters

Percent-encode percent code, encodeuri percent

Original article address: http://www.imkevinyang.com/2009/08/detailed explanation of URL encoding in cript.html

 

Summary

  • URI (unified resource identification) Codec
  • Why coding? 
  • Encoding required
  • How to code

 

Prerequisites

Foo: // example.com: 8042/over/there? Name = ferret # nose
\_/\______________/\________/\_________/\__/
|
Scheme authority path query fragment

The above is a typical URL format. Because a URL is a type of URI, the URL encoding mentioned below actually refers to URI encoding.

 

Why do I need URL encoding?

If an encoding is required, it indicates that this is not suitable for transmission. There are many reasons, such as the large content Size and private data.

The reason for URL encoding is that some characters in the URL may be ambiguous. For example, the URL parameter string uses the key-value Pair (key = value) to pass the parameter. key-value pairs are separated by the & symbol, for example, "/s? Q = abc & ie = UTF-8 ". If the value string contains '=' or '&', the server parsing error of the received URL is inevitable, therefore, escape the '=' or '&' symbol that causes ambiguity, that is, encode it.

For example, the URL encoding format uses ASCII code rather than Unicode. Therefore, you cannot include any non-ASCII characters in the URL, such as Chinese characters. Otherwise, if the character set supported by the client browser and the Server Browser is different, the problem may occur.

The URL encoding principle is to use secure characters (printable characters without special purposes or special meanings) to indicate insecure characters.

 

Which characters need to be encoded?

RFC3986 documents stipulate that the URL can contain only English letters (a-zA-Z), numbers (0-9), and four special characters (-_.~) And all reserved characters.

RFC3986 provides detailed recommendations on URL encoding and decoding, and points out which characters need to be encoded to avoid URL semantic changes, and explains why these characters need to be encoded.

Only printable characters are allowed in the URL. The 10-7F bytes in the US-ASCII code all represent control characters that cannot appear directly in the URL. At the same time, for 80-FF bytes (ISO-8859-1), it is not allowed to be placed in a URL because it is beyond the byte range defined by the US-ACII.

Reserved characters. A URL can be divided into several components: Protocol, host, and path. There are some characters (:/? # [] @) Is used to separate different components. For example, ':' is used to separate the protocol and host, '/' is used to separate the host and path ,'? 'Is used to separate paths and query parameters. There are also some characters (! $ & '() * +,; =) Is used to separate each component. For example,' = 'indicates the key-value pair in the query parameter, the '&' symbol is used to separate and query multiple key-value pairs. When common data in a component contains these special characters, it must be encoded. RFC3986 specifies the following characters as reserved characters:

! * ' ( ) ; : @ & = + $ , / ? # [ ]

 

Unsafe characters. There are also some characters that may cause the parsing program ambiguity when they are directly placed in the URL. These characters are considered unsafe for many reasons.

Space
Unnecessary spaces or meaningful spaces may be introduced in the Process of URL Processing in transmission, user layout, and text processing.
Quotation marks and <> Quotation marks and angle brackets are usually used to separate URLs in common text.
# Used to indicate bookmarks or anchor points
% Percent signs are special characters used to encode unsafe characters. Therefore, they must be encoded.
{}| \ ^ [] '~ Some gateways or transport proxies tamper with these characters.

 

Note: For valid characters in a URL, encoding and non-encoding are equivalent. However, if the characters mentioned above are not encoded, they may cause different URL semantics. Therefore, for URLs, only common English characters and numbers are supported. special characters include $-_. +! * '() And reserved characters can appear in unencoded URLs. Other characters must be encoded.

However, due to historical reasons, there are still some nonstandard coding implementations. For example '~ 'Symbol, although RFC3986 documents stipulate that for the Tilde symbol ~, URL encoding is not required, but there are still many old gateways or transmission proxies.

 

How do I encode invalid characters in a Url?

URL Encoding is also known as percentage code (Url encoding, also known as percent-Encoding) because it is very simple in encoding mode, use the % percent sign and add two characters (hexadecimal 0 ~ F ). The default Character Set of URL encoding is US-ASCII. For example, if a corresponds to 0x61 bytes in the US-ASCII code, then the URL encoded is % 61, we enter the http://g.cn/search in the address bar? Q = % 61% 62% 63 is actually equivalent to searching abc on google. For example, if the byte of the @ symbol in the ASCII character set is 0x40, % 40 is obtained after URL encoding.

URL encoding list of common characters:

Reserved Character URL Encoding
! * " ' ( ) ; : @ &
%21 %2A %22 %27 %28 %29 %3B %3A %40 %26
= + $ , / ? % # [ ]
%3D %2B %24 %2C %2F %3F %25 %23 %5B %5D

 

For non-ASCII characters, the super set of the ASCII character set must be used for encoding to obtain the corresponding bytes, and then the percent code is executed for each byte. For Unicode characters, we recommend that you use UTF-8 to encode them to obtain the corresponding bytes, and then perform percent encoding for each byte. For example: "Chinese" using the UTF-8 character set to get the byte 0xE4 0xB8 0xAD 0xE6 0x96 0x87, after URL encoding, "% E4 % B8 % AD % E6 % 96% 87" is obtained ".

  If a byte corresponds to a non-reserved character in the ASCII character set, this Byte does not need to be expressed by a percent sign. For example: "URL encoding", the bytes produced by UTF-8 encoding are 0x55 0x72 0x6C 0xE7 0xBC 0x96 0xE7 0xA0 0x81, because the first three bytes correspond to the non-reserved characters "URL" in ASCII, these three bytes can be represented by non-reserved characters "URL. The final URL encoding can be simplified to "URL % E7 % BC % 96% E7 % A0 % 81". Of course, you can also use "% 55% 72% 6C % E7 % BC % 96% E7 % A0 % 81.

Due to historical reasons, some URL encoding implementations do not fully follow this principle.

 

Other issues related to url Encoding

The processing of URLs containing Chinese characters varies with different browsers. For example, for IE, if you check Advanced Settings "always send URL with UTF-8", then the Chinese part of the URL path will be URL encoded using the UTF-8 and sent to the server, the Chinese part of the query parameters uses the default character set for URL encoding. To ensure maximum interoperability, it is recommended that all components put into the URL explicitly specify a character set for URL encoding, without relying on the browser's default implementation.

In addition, many HTTP monitoring tools or browser address bar will automatically decode the URL once when displaying the URL (using the UTF-8 Character Set ), this is why the URL displayed in the address bar contains Chinese characters when you access Google to search for Chinese Characters in Firefox. The original URL actually sent to the server is encoded. You can use Javascript to access location. href in the address bar. Do not be confused by these illusions when studying URL codec.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.