Basic concepts of 1.URL coding
URLs can only be sent over the Internet using the US-ASCII character set. Because URLs often contain characters outside of the ASCII collection, URLs must be converted to valid ASCII formats. URL encoding uses a "%" followed by a two-bit hexadecimal number to replace non-ASCII characters. URLs cannot contain spaces, and URL encodings usually use + to replace spaces. The so-called URL encoding, is the non-us-ascii characters and us-ascii in the special characters, with the corresponding character set encoding to represent. For example, the kanji "You", if encoded with UTF-8, appears in the URL is%e4%bd%a0, if the GBK encoding appears in the URL is%c4%e3.
The RFC3986 documentation stipulates that only English letters (A-ZA-Z), Numbers (0-9), 4 special characters ("-"/"."/"_"/"~") and all reserved characters are allowed in the URL . However, due to historical reasons, there are still some non-standard coding implementations. For example, for the "~" symbol, although the RFC3986 document specifies that the URL encoding is not required for wave symbols ~, there are still many older gateways or transport agents that encode the character URL. The URL divides characters into the following 4 categories:
1, literal characters. English letters and Arabic numerals, 4 special characters can appear directly anywhere in the URL, do not need to encode, because these are literal characters, do not have special meaning. For historical reasons, it is best not to directly add the "~" symbol to the URL without encoding it, and to exclude "~" from the literal character.
2, reserved characters. Often has a special meaning, in the URL has a fixed role. These characters are not required to be URL-encoded if they represent special meanings, and must be encoded if they do not appear as special meanings.
3, unsafe characters. Although these characters do not cause ambiguity in URLs, they may cause ambiguity in the parser when they are placed directly in the URL, so URL encoding is also required.
4, non-us-ascii characters. For characters in other languages, they must be converted to US-ASCII characters according to a character set. For URLs, they are generally encoded in the UTF-8 format.
2.URL Character Classification
literal characters :
In the URL does not represent special meaning, just as a normal string appears. The RFC3986 documentation provides the following:
These includeuppercase and lowercase letters, decimal digits, hyphen, period, underscore,and tilde.
literal character = alpha/digit/"-"/"." / "_" / "~"
reserved Characters :
URLs can be divided into several components, protocols, hosts, paths, and so on. There are some characters (:/?#[]@) that are used to separate different components. For example: colons are used to separate protocols and hosts,/for separating hosts and paths, for separating paths and query parameters, and so on. There are also characters (!$& ' () *+,;=) that are used to delimit each component, such as = used to represent key-value pairs in query parameters,& symbols are used to separate queries for multiple key-value pairs. When normal data in a component contains these special characters, it needs to be encoded to prevent URL ambiguity.
Reserved = Gen-delims/sub-delims
Gen-delims = ":"/"/"/"?"/"#"/"["/"]"/"@"
Sub-delims = "!"/"$"/"&"/"" "/" ("/") "/" * "/" + "/", "/"; "/" = "
The following characters are specified in RFC3986 as reserved characters:! * ' ();: @ & = + $,/? # []
Unsafe characters :
When they are placed directly in the URL, it may cause ambiguity in the parser. For example, double quotation marks ("") are used in labels to qualify URL property values. If you want to include double quotes directly in the URL, you might find the browser confusing. Therefore, you should use double quotation marks for the encoding% 22 to avoid any possible conflicts. Other reserved characters and unsafe characters should always use their encoding as well.
The unsafe characters are as follows:< > "#% {} | \ ^ ~ [] ' space
A. spaces : URL in the process of transmission, or the user in the process of typesetting, or text handlers in the process of processing URLs, it is possible to introduce insignificant spaces, or to remove those meaningful spaces.
B. quotation marks and <>: quotation marks and angle brackets are commonly used to delimit URLs in plain text
C.#: Typically used to represent bookmarks or anchor points
D.%: The percent semicolon itself is used as a special character to encode unsafe characters, so it needs to be encoded
E.{}|\^[] ' ~: Some gateways or transport agents will tamper with these characters
URL encoding/decoding on the 3.java side
The JDK has built-in Urlencoder and urldecoder to encode and decode characters on the Java side, using a reference to Javadoc documentation. The encoding and decoding rules are as follows:
1, alphanumeric characters "a" to "Z", "a" to "Z" and "0" to "9" remain unchanged
2, special characters ".", "-", "*" and "_" remain unchanged
3. The space character "" is converted to a plus sign "+"
4. All other characters are unsafe, so first use some encoding mechanisms to convert them to one or more bytes. Each byte is then represented by a 3-character string "%xy", where XY is the two-bit hexadecimal representation of the byte. The recommended encoding mechanism is UTF-8
4.javascript URL encoding/decoding
The functions in JavaScript that involve URL encoding are: Escape, encodeURI, encodeURIComponent.
Escape
The method does not encode ASCII letters and numbers, nor does it encode the following ASCII punctuation marks: * @-_ +. / 。 All other characters will be replaced by escape sequences.
encodeURI:
The method does not encode ASCII letters and numbers, nor does it encode these ASCII punctuation marks:-_. ! ~ * ' ().
The purpose of this method is to fully encode the URI, so the encodeURI () function is not escaped for the following ASCII punctuation mark with a special meaning in the URI:;/?:@&=+$,#
encodeuricomponent:
The method does not encode ASCII letters and numbers, nor does it encode these ASCII punctuation marks:-_. ! ~ * ' (). Other characters (such as:;/?:@&=+$,# These punctuation marks used to separate the URI component) are replaced by one or more hexadecimal escape sequences.
5. Implement URL Encoding yourself
The JDK's own urlencoder will not encode special characters ".", "-", "*" and "_", and these characters may be escapse ()/encodeuri ()/encodeuricomponent () in JS The function encodes some characters, which can easily cause inconsistencies between the client and the server. In order to achieve uniformity, we can agree that all characters except English uppercase and lowercase letters and Arabic numerals are unsafe characters and need to be URL encoded. Here is my reference to the Tomcat source package under the Org.apache.catalina.util.URLEncoder, the String URL encoding tool class:
Package Org.apache.catalina.util;import Java.io.bytearrayoutputstream;import Java.io.outputstreamwriter;import Java.util.bitset;public class urlencoder{protected static final char[] hexadecimal = {' 0 ', ' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ', ' 6 ', ' 7 ', ' 8 ', ' 9 ', ' A ', ' B ', ' C ', ' D ', ' E ', ' F '}; Array containing the safe characters set. protected BitSet safecharacters = new BitSet (256); Public Urlencoder () {for (char i = ' a '; I <= ' z '; i++) {safecharacters.set (i); } for (char i = ' A '; I <= ' Z '; i++) {safecharacters.set (i); } for (char i = ' 0 '; I <= ' 9 '; i++) {safecharacters.set (i); }} public string encode (string path) throws Exception {//Path URL-encoded string StringBuilder Rewritte Npath = new StringBuilder (Path.length ()); Byte output stream, data is written to a byte array int maxbytesperchar = 10; Bytearrayoutputstream Bytebuff = new BytearrayoutputstreAM (Maxbytesperchar); Converts between a byte stream and a character stream according to a specific charset outputstreamwriter writer = new OutputStreamWriter (Bytebuff, "UTF-8"); for (int i = 0; i < path.length (); i++) {int c = Path.charat (i); if (Safecharacters.get (c)) {Rewrittenpath.append ((char) c); } else {writer.write ((char) c); Writer.flush (); Get the byte array corresponding to the character byte[] ba = Bytebuff.tobytearray (); for (int j = 0; J < Ba.length; J + +) {byte Toencode = ba[j]; Rewrittenpath.append ('% '); BYTE low 4-bit value int. = Toencode & 0x0f; Byte high 4-bit value int. = (Toencode & 0xf0) >> 4; Rewrittenpath.append (Hexadecimal[high]); Rewrittenpath.append (Hexadecimal[low]); } Byte stream array empty bytebuff.reset (); }} return rewrittenpath.tostring (); }}
6. ReferencesURL encoding (percentcode percent code)
http://www.cnblogs.com/leaven/archive/2012/07/12/2588746.html
About URL encoding
http://www.ruanyifeng.com/blog/2010/02/url_encoding.html
character-coded notes: Ascii,unicode and UTF-8
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
Understand the basic concepts of URL encoding, encode and decode using built-in APIs in JavaScript and Java programs