URL encoding is a troublesome task. RFC 3986 is a standard for uri. Section 2nd defines how characters are represented in Uris, section 3rd classifies a URI as scheme, hier-part, query, and fragment components. According to this RFC, a URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. Appendix A defines ABNF.
Such as URL, http://www.qingbo.org /? P = 230 # comments, which includes all the four components mentioned above. This URL does not require percent encoding, because no reserved words are contained in each component, all of which are letters, numbers, or non-reserved ASCII visible characters (see section 3986 of RFC 2.3 ).
Suppose we open this URL in Firefox and want to add it to favorites through the Del. icio. us plug-in button. Del. icio. the US plug-in opens a new window and sends a request to the server through the get method. The URL and its corresponding title are passed as the query parameter to the server, the server fills these two values in the corresponding input value attribute.
If not encoded, the URL in this GET request is "http://del.icio.us/flimsy? Url = http://www.qingbo.org /? P = 230 # Comments & Title =» Blog Archive» the blog looks like & noui & jump = close & V = 4 ″. The problem arises. # What is next to the number? It should be interpreted as an anchor on the page. However, # comments is only part of the URL parameter. In addition, the URL contains Chinese characters and does not comply with the standard. Therefore, encoding is required. Perform percent encoding for each component and each parameter value in the query. note that not the whole URL (Del. icio. if the question mark after flimsy is encoded, the server does not know that it is followed by the query part. The link after correct encoding should be so long that it will not be displayed. You can copy the link address to see it (it seems that the browser automatically decode again when it is displayed, click to see the encoding result in the address bar ).
URL encoding requires that the URL be first converted into a byte sequence of a UTF-8 and then percent encoding, which is described on RFC 3986 and W3C websites. Javascript is used to implement the plug-in for Firefox.ProgramLogic, while the string in Javascript is UTF-8 encoding, and there is a convenient encodeuricomponent function can do URL encoding. In addition, there are two functions, escape/encodeuri. For details about the comparison between the three, refer to this article.Article.
If there is no ready-made function, it is more convenient to percent encoding the UTF-8 byte sequence. The unreserved character value does not need to be converted. All other bytes are represented by % hexdig. In addition to "% 20", spaces can also be converted to "+" to save space.
A very important question is how to convert characters (except ASCII characters, mainly Chinese characters, etc.) into byte sequences of UTF-8. Not every language is as convenient as JavaScript. For example, in C ++, you get a wide string containing Chinese characters. How can this problem be solved? Win32 API has an internetcanonicalizeurl function, but it only targets one byte sequence and does not consider Chinese encoding conversion. In the msdn "Standard URL" definition, "characters that must be encoded" does not mention the processing of wide characters, or even the UTF-8. It also requires that the string must contain a scheme.
If I have time, I will write another article about how to convert Chinese to UTF-8 byte sequences in windows, which may be helpful to my friends who encode Chinese URLs. See the article "GBK (gb2312) to the UTF-8 encoding conversion.
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/fanwenbo/archive/2008/04/14/2291878.aspx