URL Coding Knowledge Excerpt memo

Source: Internet
Author: User
Tags control characters html form printable characters rfc

Web Tools http://www.107000.com/T-UrlEncode/

Reference:

Wikipedia http://zh.wikipedia.org/zh/%E7%99%BE%E5%88%86%E5%8F%B7%E7%BC%96%E7%A0%81

Baidu Encyclopedia Http://baike.baidu.com/view/204662.htm

Blog Park A blog http://kb.cnblogs.com/page/133765/

Overview

The percent-semicolon encoding (percent-encoding), also known as URL encoding (URL encoding), is the encoding mechanism for a Uniform Resource Locator (URL) for a particular context. In fact, it also applies to the encoding of the Uniform Resource Identifier (URI). It is also used to prepare data for the "application/x-www-form-urlencoded" mime, because it is used to submit HTML form data through the request operation of HTTP.

Usually if something needs to be coded, it means that something is not suitable for transmission. There are a variety of reasons, such as size too large to contain private data, and for URLs, the reason for encoding is because some characters in the URL cause ambiguity.

The characters allowed by the URI are reserved and not reserved . reserved characters are those characters that have special meanings. For example, the slash character is used for the delimiter of different parts of the URL (or more generally, the URI). Characters that are not reserved do not have these special meanings. The percent-semicolon encoding represents a reserved character as a sequence of special characters. The above scenario varies slightly depending on the version specification of the URI and URI.

The characters allowed by the URI are reserved and not reserved. Reserved characters are those characters that have special meanings. For example, the slash character is used for the delimiter of different parts of the URL (or more generally, the URI). Characters that are not reserved do not have these special meanings. The percent-semicolon encoding represents a reserved character as a sequence of special characters. The above scenario varies slightly depending on the version specification of the URI and URI.

RFC 3986 Section 2.2 reserved characters (January 2005)

!

*

(

)

;

:

@

&

=

+

$

,

/

?

#

[

]

RFC 3986 Section 2.3 unreserved characters (January 2005)

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

0

1

2

3

4

5

6

7

8

9

-

_

.

~

Other characters in the URI must be in percent-coded. Other characters in the URI must be in percent-coded.

The principle of URL encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent unsafe characters.

For example, the URL parameter string uses the Key=value key value pair to pass the parameter, separating the key-value pairs with a & symbol, such as/s?q=abc&ie=utf-8. If your value string contains = or &, then it is bound to cause the server parsing error to receive the URL, so the ambiguous & and = symbol must be escaped, that is, encoded.

Another example is that theURL is encoded in ASCII instead of Unicode, which means that you cannot include any non-ASCII characters in the URL, such as You cannot include Chinese in the URL. Otherwise, Chinese can cause problems if the client browser and the server-side browser support different character sets.

Reference

ASCII Wikipedia Http://zh.wikipedia.org/zh-cn/ASCII

ASCII Baidu Encyclopedia Http://baike.baidu.com/view/15482.htm

Unicode Wikipedia Http://zh.wikipedia.org/zh-cn/Unicode

Unicode Baidu Encyclopedia Http://baike.baidu.com/view/40801.htm

UTF-8 Wikipedia http://zh.wikipedia.org/zh-cn/UTF-8

UTF-8 Baidu Encyclopedia Http://baike.baidu.com/view/25412.htm

which characters need to be encoded    

The RFC3986 documentation stipulates that only English letters (A-ZA-Z), Numbers (0-9),-_.~4 special characters, and all reserved characters are allowed in the URL. The RFC3986 document makes a detailed recommendation on the encoding and decoding of URLs, indicating which characters need to be encoded to not cause a change in URL semantics, and explain why these characters need to be encoded.

There are no printable characters in the us-ascii character set: Only printable characters are allowed in the URL. The 10-7f bytes in the US-ASCII code all represent control characters that do not appear directly in the URL. Also, for 80-ff bytes (iso-8859-1), the byte range defined by the US-ACII has been exceeded and therefore cannot be placed in the URL.

Reserved characters: URLs can be divided into several components, protocols, hosts, paths, and so on. There are some characters (:/?#[]@) that are used to separate different components. For example: colons are used to separate protocols and hosts,/for separating hosts and paths, for separating paths and query parameters, and so on. There are also characters (!$& ' () *+,;=) that are used to delimit each component, such as = used to represent key-value pairs in query parameters,& symbols are used to separate queries for multiple key-value pairs. When normal data in a component contains these special characters, it needs to be encoded.

The following characters are reserved characters in RFC3986:! * ‘ ( ) ; : @ & = + $,/? # [ ]

Unsafe characters: There are also some characters that may cause ambiguity in the parser when they are placed directly in the URL. These characters are considered unsafe characters for a number of reasons.

Spaces: URL in the process of transmission, or the user in the process of typesetting, or text handlers in the process of processing URLs, it is possible to introduce insignificant spaces, or to remove those meaningful spaces.

quotation marks and <>: quotation marks and angle brackets are commonly used to delimit URLs in plain text

#: Typically used to represent bookmarks or anchor points

%: The percent semicolon itself is used as a special character to encode unsafe characters, so it needs to be encoded

{}|\^[] ' ~: Some gateways or transport agents will tamper with these characters

It is important to note that for legitimate characters in URLs, encoding and non-coding are equivalent, but for the above mentioned characters, they may cause different URL semantics if they are not encoded. Therefore, for URLs, only ordinary English characters and numbers, special character $-_.+!* ' () and reserved characters, can appear in the URL without encoding. All other characters need to be encoded before they appear in the URL.

However, due to historical reasons, there are still some non-standard coding implementations. For example, for the ~ symbol, although the RFC3986 documentation stipulates that for wave symbols ~, URL encoding is not required, but there are still many older gateways or transport agents that encode.

How to URL encoding of illegal characters in

URL encoding is also commonly referred to as a percent-encoding (URL encoding,also known as percent-encoding) because it is encoded in a very simple way, using the% percent sign plus the two-bit character--0123456789abcdef-- Represents a 16 binary form of a byte. The default character set used by URL encoding is US-ASCII. For example A in the US-ASCII code in the corresponding byte is 0x61, then the URL encoding is%61, we enter http://g.cn/search?q=%61%62%63 on the address bar, in fact, the equivalent of searching for ABC on google. Another example of the @ symbol in the ASCII character set of the corresponding byte is 0x40, after the URL encoded by the%40.

For non-ASCII characters, a superset of the ASCII character set is required to encode the corresponding bytes, and then the percent code is executed for each byte. For Unicode characters, the RFC document recommends using UTF-8 to encode the corresponding bytes, and then perform a percent-encoding on each byte. For example, "Chinese" uses the UTF-8 character set to get the byte 0xe4 0xb8 0xAD 0xe6 0x96 0x87, after URL encoding to get "%e4%b8%ad%e6%96%87".

If a byte corresponds to a non-reserved character in the ASCII character set, this byte does not need to be represented by a percent sign. For example, "URL encoding", the bytes obtained using UTF-8 encoding is 0x55 0x72 0x6c 0xE7 0xBC 0x96 0xE7 0xA0 0x81, because the first three bytes correspond to the non-reserved character "url" in ASCII, so these three bytes can be used non-reserved character "url" Said. The final URL encoding can be simplified to "url%e7%bc%96%e7%a0%81", of course, if you use "%55%72%6c%e7%bc%96%e7%a0%81" is also possible.

For historical reasons, there are some URL encoding implementations that do not fully follow this principle.

See Blog Park Blog http://kb.cnblogs.com/page/133765/

<<<<<<<<<<<<<<<<<<>>>>>>>>> >>>>>>>>>>

Note: When I use the http://www.107000.com/T-Utf8/tool, I find that a Chinese character is converted to two bytes (4 hexadecimal digits) instead of three bytes. After the actual storage of Notepad, the HxD tool verifies that these Web pages are actually converted in order to

For example, "Chinese" word: (add strikethrough is the file information header, refer to file Signatures Table http://www.garykessler.net/library/file_sigs.html)

Web tools: & #x4E2D;& #x6587;

Notepad UTF-8:EF BB BF E4 B8 AD E6

Notepad Unicode: FF FE 2D 4E

Notepad Unicode Big Endian: FE FF 4E 2D

Actual URL code:%e4%b8%ad%e6%96%87

The Visible web tool uses the Unicode Big Endian. The actual URL encoding takes UTF-8.

For little Endian and Big Endian, refer to the previous blog text:

when storing data, one address stores one byte. If a data needs to be expressed in more than one byte, then the order of storing the data is divided into two kinds: high-address storage, low-level data (small-end mode,Little Endian, for computer processing) High address stores low-level data, low-address storage high-bit data (Big Endian, similar to human general thinking). BMP files are small-ended, high-address storage of higher data, low-address storage data. Dump two links:

http://blog.csdn.net/hackbuteer1/article/details/7722667

Http://www.cnblogs.com/TsuiLei/archive/2008/10/29/1322504.html

<<<<<<<<<<<<<<<<<<>>>>>>>>> >>>>>>>>>>

attached : Uniform Resource Identifier URI , with URL , URN

Http://zh.wikipedia.org/wiki/%E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E6%A0%87%E5%BF%97%E7%AC%A6

In computer terminology, a Uniform Resource identifier (Uniform Resource Identifier, or URI) is a string that identifies the name of an Internet resource. This type of identification allows users to interact with resources in the network (generally referred to as the World Wide Web) through specific protocols. The URI is defined by a scheme that includes a determination of syntax and related protocols.

Relationship to URLs and urns

Right: URL scheme classification diagram

The URL (locator) and URN (name) scheme belong to the subclass of the URI, and the URI can be either a URL or a urn or both a URI and a urn. Technically, URLs and urns are resource IDs; however, it is often impossible to classify a scheme in one of the two: all URIs can be treated as names, and some scenarios embody different parts of the two.

A URI can be considered a locator (URL), a name (URN), or both. A Uniform Resource name (URN) is like the name of a person, whereas a Uniform Resource Locator (URL) represents a person's address. In other words, the urn defines the identity of something, and the URL provides a way to find it.

URL Coding Knowledge Excerpt memo

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.